Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
THE UNIVERSITY OF CALIFORNIA
SANTA CRUZ
BIASED CLUSTERED SUBSTITUTIONS IN THE HUMAN GENOME:
SEX, GAMBLING AND NON-DARWINIAN EVOLUTION
A thesis submitted in partial satisfaction
of the requirements for the degree of
MASTER OF SCIENCE
in
BIOINFORMATICS
by
Timothy R. Dreszer
December 2006
The Thesis of Timothy R. Dreszer is approved: _______________________________ Professor David Haussler, Chair _______________________________ Professor Harry Noller _______________________________ Professor Joshua Stuart
_______________________________ Lisa C. Sloan Vice Provost and Dean of Graduate Studies
Copyright © by Timothy R. Dreszer
2006
iii
Table of Contents
Table of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Surprising Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Large Scale Bias in G+C? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Mutation vs. Natural Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Non-Darwinian Selection? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Placing Bets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Charting the Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.0 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Preparation of Two Sets of Single Base Pair “Differences” . . . . . . 11
2.2 Three Lenses to View the Secrets of Bias . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 The Window Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 “Biased Clustered Substitutions”
or How Filtering by Nearest Neighbors Reveals UBCS . . . . . . 17
2.2.3 Finding Regions of High Density of Bias . . . . . . . . . . . . . . . . . . 20
2.3 Searching for a Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
2.3.2 G+C Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Conservation Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Telomeric Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.5 Recombination Hot Spot Location . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 Recombination Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.7 Transcription Density and Transcription Evidence . . . . . . . . . . . . 28
2.4 Statistical Tools and Visual Aids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Window Based Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Analyzing UBCS with Zippers and Maps . . . . . . . . . . . . . . . . . . 31
2.5 Dating the Fusion of Chromosome 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.0 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Bias as a Social Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Documenting Gang Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Biased Groups are Recruited from Unbiased Individuals . . . . . . 41
3.2 Focusing on Bias through the Window Lens . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Conservative Bias? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Do the Strong Convert the Weak? . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Bias at the Hot Spots and on the Edge of Town . . . . . . . . . . . . 51
3.3 Geographic Distribution of Biased Groups . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Near Universal Pattern of Bias Leaves Evidence of a Fusion . . . . . . 53
3.3.2 Predictable Males and Enigmatic Females . . . . . . . . . . . . . . . . . . 59
3.3.3 Are Humans More Biased than Chimpanzees? . . . . . . . . . . . . 63
v
3.3.4 Biased Without a Cause or Are Boys Troublemakers? . . . . . . . . . . . . 67
3.3.5 Following the Footprints of Past Recombinations . . . . . . . . . . . . 71
3.3.6 Seeing Ghosts? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Humans have been Molded by Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.1 Fastest Evolving Region of the Human Genome . . . . . . . . . . . . 79
3.4.2 Serotonin Receptor Knocked Out in Humans and Chimps . . . . . . 80
3.4.3 Mistakes Were Made . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.4 Biased Clusters Are Transcribed . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.5 Currently Bias May be Leading to Thrill Seeking and Disease . . . . 88
4.0 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1 Unexplained Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Just like Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Two Chromosomes Come Together on a Date . . . . . . . . . . . . . . . . . . 93
4.4 The X-Exception: Are Men Really to Blame? . . . . . . . . . . . . . . . . . . 96
4.5 The Gamble of Male Meiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6 A Thumb on the Scales of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.0 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
vi
Table of Figures
Figure 1. Biased Gene Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Table 1. Substitution Totals for Human Genome . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2. The Empirical Probability of Bias Due to Substitution Count . . . . . . 37
Figure 3. “Zipper Plots”: Bias for Clusters of N Substitutions . . . . . . . . . . . 38
Figure 4. Bias for Substitutions within N bases . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 5. Cluster Bias Heat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 6. Weak to Strong Bias in Single Nucleotide Polymorphisms . . . . . . 43
Figure 7. Bias as a Function of Conservation Score . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 8. Empirical Bias as a function of G+C Content . . . . . . . . . . . . . . . . . . 46
Figure 9. G+C Content Affects Clusters of Substitutions . . . . . . . . . . . . . . . . . . 48
Figure 10. Conditional Empirical Probability of Bias by Substitution Count. . . . . . 50
Figure 11. Hot Spot are Slightly More Biased . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 12. Bias at Sub-telomeric Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 13. Mapping Chromosome 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 14. “Winged Maps”: Unexpected Bias is Predictable . . . . . . . . . . . . 56
Figure 15. Exceptions to the Pattern of Unexpected Biased Substitutions . . . . . . 58
Figure 16. Zipper Plots of Four Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 17. Heat Maps of Bias for Four Chromosomes . . . . . . . . . . . . . . . . . . 62
Figure 18. Biased Clustered Substitutions in the Chimpanzee Genome . . . . . . . . . 64
Figure 19. UBCS Profile is Similar between Humans and Chimps . . . . . . . . . . . . 66
vii
Figure 20. The “X Exception” in Chimpanzees . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 21. Correlations of Biased Clustering on Chromosome 18 . . . . . . . . . . . . 69
Figure 22. Correlations of UBCS Genome Wide . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 23. Mapping UBCS, G+C Content and Recombination Rates on Chr2 . . . 73
Figure 24. Effects of other Factors beyond Male Recombination Rates . . . . . . 76
Table 2. The Top Regions of Biased Clustered Substitutions in Humans . . . . . . 79
Table 3. Predicted Changes to HTR5B Due to Point Substitutions . . . . . . . . . . . . 84
Table 4. Evidence of Transcription Among Biased Regions . . . . . . . . . . . . . . . . . . 87
Table 5. Most Biased Clusters of SNPs found in the Human Genome . . . . . . 89
Biased Clustered Substitutions in the Human Genome:
Sex, Gambling and Non-Darwinian Evolution
Timothy R. Dreszer
ABSTRACT
After the discovery that the fastest evolving regions of the human genome show a
“bias” in point substitutions from weak to strong base pairs, this study was
undertaken to characterize patterns of bias, genome wide. Using windows of 100bp,
elevated bias was found for clusters of 5-11 substitutions. Conservation and location
near recombination hotspots were poor predictors of bias, while local G+C content
and sub-telomere location were mildly predictive. Using a nearest neighbor analysis,
bias was shown to occur in clusters of 5 or more substitutions and peak when they are
within 80bp of each other. No biased clustering was found in SNPs, suggesting that
biased substitutions were selected from mutations. Unexpected biased clustered
substitutions (UBCS) were mapped across the human and chimp genomes. This
revealed a universal pattern of elevated bias near the telomeres of all autosomes but
not the sex chromosomes. Human and chimp cousin chromosomes show a
remarkable similarity in the shape and magnitude of their respective UBCS maps,
suggesting a relatively stable force leads to clustered bias. The strongly telomeric
signal may offer an explanation for the evolution of isochores. Additionally,
chromosome 2 shows a UBCS peak mid-chromosome, which maps to the fusion site
of two ancestral chromosomes. This may provide evidence that the fusion occurred as
recently as 0.93 MYA. UBCS is most closely correlated with male recombination
rates, which explains the lack of UBCS signal on chromosome X. Female
recombination rates are unrelated to the residual UBCS signal unexplained by male
recombination. Conservation score and transcription density are also unrelated to
residual UBCS, but local G+C content is. Finally, the most highly biased regions in
the human genome are more likely to be transcribed than chance predicts, and show
specific evidence of UBCS affecting the evolution of humans. Taken together, this
genome wide analysis provides evidence that Biased Gene Conversion is the most
likely cause of the biased clustered substitution pattern found in humans. It is possible
that BGC is a male reproductive strategy that behaves like a neutral selection
pressure, increasing rates of genetic drift and accelerating evolution overall.
x
Acknowledgements
Katie Pollard has inspired me with her work on fastest evolving regions of the human
genome and has patiently explained to me many things that I was too dense to
comprehend. Daryl Thomas provided the “Chimp Fixed Differences” dataset of
substitutions and the HapMap dataset of SNPs, both aligned to out-group species, and
has also offered patient guidance. Jim Kent has written a large collection of source
code that I was able to call upon for this project, and prepared the “Chimp Simple
Differences” dataset that started it all. I am forever grateful to David Haussler who
has offered me a chance to work in his lab and inspired me to seek the answers to his
never ending stream of questions. Finally, my wife, Lena, supported me in every
way; my son, Taras, was my sounding board from the start; and my daughter,
Natalya, kept me from finishing before the best discoveries were made.
1
1.0 Introduction
With the publishing of the Chimpanzee Genome[1], detailed analysis of the genetic
differences between humans and chimps can begin. This effort will no doubt lead to
the discovery of many of the determinants that make us uniquely human. However,
the significance of the availability of the pair of very closely related genomes offers
an opportunity of pursuing even more fundamental goals than this. While our species
may legitimately lay claim to uniqueness among earth’s life forms, all available
evidence points to the overwhelming similarity of our genetics. Therefore, a study of
the differences between the human and chimp genomes allows us to glimpse the
forces that shape genomes over time. Thus, the possibility arises to illuminate some
of the mechanisms of evolution and characterize the forces of life. While this thesis
cannot answer such grand questions, it attempts to characterize a little understood
evolutionary pressure, which is distinct from natural selection.
1.1 Surprising Bias
Previous work in this lab was carried out in pursuit of the fastest evolving regions of
the human genome.[2, 3] In the process, a surprising characteristic of the top scoring
regions was uncovered. Single base substitutions were dramatically biased from
weak to strong pairing bases. For instance, in the top four fastest evolving regions,
there were 33 cases of an AT pair being replaced by a GC pair, but only one case of a
GC being replaced by an AT. Thus bases which pair with 2 hydrogen bonds
(“weak”) were replaced by bases that pair with three (“strong”). This is even more
2
surprising when a strong to weak mutation bias overall has been well documented.[4, 5,
6] Since the process of natural selection should “fix” randomly occurring mutations
into the genome based upon the relative fitness of the mutation, this result suggests
that in these particular cases, the stronger base pairs provide greater fitness.
However, since this “bias” from weak to strong is found in all of the fastest evolving
regions, it suggests that there may be more to this story than natural selection acting
upon individual mutations.
1.2 Large Scale Bias in G+C?
For many years it has been clear that the proportion of G+C in a mammalian genome
can vary widely.[7, 8] Certain areas of the warm-blooded vertebrate genome greater
than 300 kilobases have strikingly greater or lesser proportions of G+C than
surrounding areas.[9] These areas have been dubbed “isochores” and have been
discussed widely, though the reason they exist is still under debate. While it is true
that evolutionarily conserved regions tend to have higher G+C content, it is notable
that isochores stretch across conserved and non-conserved regions of the genome. It
is also apparent that the G+C content of genes is correlated with the G+C content of
the isochores within which they are found.[10] Many of the genes in high G+C
isochores have homologs in organisms with no isochores, such as zebrafish; yet the
homologs show significantly less G+C content. This suggests that the motive force
that has generated isochores has applied pressure to the genes that are found within.
On the other hand, one study shows that the G+C content of genes in isochores is
3
greater than surrounding regions[11], suggesting that isochores have been pulled along
by the G+C requirements of the genes within them. The force that acts to create or
maintain G+C isochores may be the same force that has biased substitutions in recent
human evolution. But is this force natural selection or something distinct?
1.3 Mutation vs. Natural Selection
While the motive force behind biased substitutions in recent human evolution may
not be the same as that for the evolution of isochores; the theories of isochore
formation do parallel possible explanations for recent bias. Three main theories[12]
have arisen to explain the existence of isochores. The first involves variation in
mutation rates[13, 14] in different areas of a genome. This theory suggests that the
initial mutations are not random, at least in the proportion of G+C, and that the bias in
substitution rates really reflects the proportion of mutations available for fixing.
While this theory does not preclude selection occurring, it suggests that where there is
no selection, bias will still arise. If a mutation bias existed, it might easily explain
recent bias in humans. The most obvious contrasting model is that natural selection[9]
has driven the formation of isochores. While hard to disprove, there is not much
evidence to suggest why natural selection would be acting at the level of hundreds
and even thousands of kilobases, and the existence of isochores in relatively
unconserved regions stands in contrast. The most frequently mentioned advantage to
regions of higher or lower G+C content is the relative thermal stability of G+C DNA
and this seems to correlate with isochores being found in warm blooded vertebrates.
4
However, there is no correlation between G+C content and optimal growth
temperature in bacteria, which shows that selection based upon thermal stability does
not appear to be happening in bacteria.[15] Nevertheless, while natural selection based
upon relative fitness may not ultimately explain isochore formation, it may indeed
explain the bias found in recent human evolution. The search for the fastest evolving
regions of the human genome, which contained surprising bias, was specifically a
search for highly conserved, and therefore evolutionarily significant regions. Thus,
natural selection might very well be the cause of bias in these cases. Indeed,
ubiquitously expressed “housekeeping” genes have been correlated with higher G+C
content[16], though the strength of this correlation has been disputed.[17] More to the
point, however, is the finding that increased G+C content alone leads directly to
increased gene expression.[18] This increase was not due to increased translation
efficiency, or mRNA stability, but transcription or pre-mRNA processing. Thus,
selection for increased expression may certainly be the cause of the substitution bias
in rapidly evolving regions.
1.4 Non-Darwinian Selection?
A third theory involves Biased Gene Conversion (BGC).[19] BGC is the result of a
DNA repair mechanism which fixes base mismatches and is biased in favor of G and
C.[20, 21] The model of BGC producing biased substitution is that the biased repair
acts upon single nucleotide polymorphisms (SNPs). In recombination, the strands
from two homologous sister chromosomes will form a heteroduplex. Any
5
mismatched bases in the heteroduplex (i.e. resulting from two alleles of the same
gene) may be “repaired” and bias can be introduced as in Figure 1. BGC occurs in
one individual during a recombination event, but recombination events are known to
occur at hotspots[22] which are shared by individuals within a species. While
recombination hotspots change over time[23, 24], it is entirely plausible that a
significant number of individuals will have recombination events at the same location
or in close proximity within the genome. If these recombination events result in
biased “selection” of SNPs, then Biased Gene Conversion will act as a selection
pressure, distinct from natural selection. As recombination hotspots move over time,
BGC acting upon SNPs may create and maintain G+C rich isochores. Evidence in
favor of BGC being the force that gives rise to isochores can be found in the
correlation of recombination rates and G+C content.[19, 25] Evidence of the correlation
of BGC with recombination has been found in organisms as diverse as humans,
rodents, birds, worms, insects, plants and fungi.[26]
6
Figure 1. Biased Gene Conversion occurs when recombination br ings a section of two sister chromosomes together in a heteroduplex. I f the region contains SNP alleles, the mismatched bases will be subject to a “ repair ” . Since both alleles are valid, there is no “ cor rect” base to be prefer red. The repair process favors G-C pairs over A-T pairs. The result is “ weak to strong” bias acting as a selection pressure on SNPs. When a cluster of SNPs is affected, the resulting cluster may differ from either parent, creating a novel genotype.
Again, even if BGC has nothing to do with the creation or maintenance of isochores,
it may be the cause of the biased substitutions found in the fastest evolving regions of
the human genome, and in the bias found in this work. Perhaps the most significant
effect of BGC will occur when clusters of SNPs are converted into biased clusters of
SNPs. Not only will “strong” base pairs be preferred over “weak” pairs, but the
Biased Gene Conversion: Mismatched SNP Repair Dur ing Recombination
T G GCTGTAGATCGTTG ACGTA GATTACGTCGT CGACATCTAGCAAT TGCAT CTAATGCAGCA C A
Both mismatches are converted to “strong” G-C pairs, replacing “weak” SNPs.
7
resulting biased cluster of mutations may be a novel genotype not found in either
parent chromosome. Natural selection can be expected to overpower BGC pressure if
a BGC created allele results in significantly less fitness. However, in the absence of
strong selective advantage or disadvantage, BGC would theoretically lead to the
fixation of higher G+C. Stepping back a moment from the issue of isochores or
recent human bias, the implication of BGC, if it does create a force resembling
selection pressure, is that the evolution of species is not merely shaped by natural
selection as Darwin described it. In the absence of positive or negative selection,
directional evolution will still occur. In particular, genetic drift of isolated
populations should be sped along by biased gene conversion.[27] And when BGC
selection is combined with natural selection, the result should be faster evolution, and
in some cases, evolution without selective advantage.
1.5 Placing Bets
The three models of bias give rise to different predictions of SNP and fixed
substitutions. If bias is the result of an underlying mutation rate, then a bias should be
seen in SNPs, but if higher G+C is the result of a selection pressure, there should be a
more pronounced bias in substitutions than in SNPs. Distinguishing between natural
selection and “BGC selection” is a bit trickier. It can be predicted that if isochores
are due to natural selection then substitution bias should be strikingly different
between areas of low and high G+C content. However, the BGC model makes a
similar prediction: that areas of high recombination should show greater substitution
8
bias, and recombination and G+C content have been shown to be correlated.[19]
Another possible distinction is in the size of an area which undergoes bias. That
isochores are in the hundreds of kilobases, the natural selection model would predict
substitution bias should occur relatively uniformly within an isochore. BGC, on the
other hand, would predict much more localized substitution bias, since the bias should
be due to a repair process occurring during the transient state of the heteroduplex
formed during recombination.[28] That the local bias of the BGC model would give
rise to large isochores is due to the size of recombination hotspots and to the tendency
of the hotspots to shift over time. Thus, the two models of isochore selection should
result in predictably different patterns of bias seen in the recently fixed substitutions
in a species.
It is also possible that recent human bias is not be the result of the same process that
has given rise to isochores. Even so, the two models of selection are still competing
explanations of this recent phenomenon. If natural selection were the cause of recent
human bias, then that bias should be more strongly correlated to conserved areas of
the human genome. However, if BGC is the cause, then bias should be more strongly
correlated to recombination rates, and show much less correlation with conserved
areas.
9
1.6 Charting the Goals of this Thesis
The availability of the high quality sequence of the human genome and the more
recently published chimpanzee genome, provides an opportunity to distinguish
between the two models of selection as explanations for recent human bias.
Additionally, the availability of a large set of SNP data for the human genome allows
the evaluation the mutation bias model as well. Therefore, this effort characterizes
weak to strong biased substitutions across the entire human genome since it diverged
from the chimpanzee line. An explanation for the cause of any bias found has been
sought in underlying mutation rate, natural selection based upon fitness and selection
due to biased gene conversion. An additional goal was to map the location within
the genome of biased substitutions which have occurred in the last six million years
of human evolution. While such a map of bias is interesting in what it reveals about
humans, it has proven to illuminate a fundamental force shaping the evolution of not
only ourselves, but no doubt a vast number of other species as well.
10
2.0 Methods
All research was undertaken using the genome assemblies of several species, freely
available from the UCSC Genome Browser.[29, 30, 31] Organization, analysis,
calculations, and plotting were performed using either the programming language C
or the statistical package R.[32]
2.1 Datasets
The following datasets were used for this research:
1. Unless otherwise stated, all research was based upon the sequence and locations
found in the May 2004 assembly of the human genome[29, 30, 31] (referred to as
hg17). As will be noted in the text, earlier work used the July 2003 release
(hg16), and more recent work used the March 2006 assembly (hg18).
2. Unless noted otherwise, all research was based upon the alignment to hg17 of the
November 2003 assembly of the chimpanzee (Pan troglodytes) genome[1]
(referred to as pt1). Earlier work made use of a prerelease assembly (pt0), and the
most recent work used the January 2006 assembly (pt2).
3. In order to determine whether a given substitution occurred on the human or
chimp line, an aligned out-group genome was needed. Unless otherwise noted,
the January 2005 pre-release assembly of the Rhesus macaque (Macaca mulatta)
genome[33] (rh0) was used. Earlier work involved both the March 2005 assembly
of the mouse (Mus musculus) genome[34] (mm6) and the June 2003 assembly of
11
the rat (Rattus norvegicus) genome[35] (rn3) as out-groups, while the most recent
work made use of the January 2006 assembly of macaque (rh1).
4. Analysis of bias in single nucleotide polymorphisms (SNPs) was undertaken
using the International HapMap Project’s October 2005 release of haplotype map
for humans.[36]
5. Recombination hot spots were located using the September 2005 release of
HapMap Phase I data from the International HapMap Project.[36]
6. Recombination rates were provided by the deCODE genetic map[37] based upon
1,257 meiotic events.
7. Designation of genes was taken from the “Known Genes” track of the UCSC
browser and was compiled using protein data from UniProt[38] and mRNA data
from NCBI.[39, 40] Designation of mRNAs was taken from the UCSC browser
“Human mRNA” track and expressed sequence tags from the from UCSC
“Human EST” track, sources for both were from international public sequence
databases.[41]
2.1.1 Preparation of Two Sets of Single Base Pair “Differences”
The above genome sequences were prepared into two distinct datasets: fixed
substitutions and SNPs. Since many of the exact same analyses were performed
separately upon the two types of single base pair changes in the human genome, the
term “differences” is used to refer generically to both types. Preparation of the fixed
substitutions dataset involved the creation of a set of single nucleotide differences
12
between human and chimp in regions of high quality chimp sequence (prepared by
Jim Kent) and the inclusion of high quality macaque bases and the lifting of genome
locations to a common assembly (prepared by Daryl Thomas). The SNP dataset was
also combined with corresponding chimp and macaque bases and lifted to hg17
locations (by Daryl Thomas). The two resulting bed files (substitutions and SNPs)
were then converted into pairs of arrays containing location and base change.
Twenty-four pairs of arrays (one for each chromosome) allowed rapid location of
base changes using a binary search algorithm. A base change was reduced to a single
8 bit value which distinguishes the following attributes:
1. “Direction of Change” or the from and two pairs for the four base possibilities,
resulting in 12 combinations. By choosing the proper values for the direction, it
was possible to resolve weak to strong and strong to weak with a simple bit mask
(i.e.: AtoC:0001 AtoG:0011 TtoC:0101 TtoG:0111: WtoS_MASK=binary 0001 ).
2. “Ancestry” or the concept of which line the base difference arose in. Obviously
for SNPs, this is always the human line. However, the concept is still needed to
establish the direction of mutation between two alleles.
In order to determine the likely ancestor base for the fixed substitution arrays, the
aligned out-group base was used. For the fixed substitutions dataset, the line (chimp
or human) that matched the macaque base was designated “ancestral”, and the other
“derived”. Otherwise, ancestry was “indeterminate”. Direction for the fixed
substitution dataset for hg17 was always stored as chimp to human. However, since
13
only substitutions derived in the human line were used in the hg17 analysis, this has
the result of direction always being from ancestor to descendent. More recent work
analyzing bias in the chimp genome used identical methods to create the arrays, but
used chimp locations and reversed the direction of the dataset masks to show
direction from human (hg18) to chimp (pt2).
Since this collection of “substitutions” between humans and chimps can be expected
to contain some number of human SNPs which have not been fixed, final processing
of the substitutions dataset involved subtracting any locations found in both the
substitution and SNP datasets. Of 28,937,901 high quality simple differences
between humans and chimps found in hg17, 24,795,278 remained after aligning with
Rhesus macaque and 23,916,284 remained after subtracting SNPs. Of these,
22,784,742, showed unambiguous ancestry with 10,871,714 derived in humans and
another 11,913,028 derived in the chimp line (see Table 1 of results).
For the SNP dataset, an out-group was needed to determine the ancestral allele and
therefore the direction of change. Both Pan troglodyte and Rhesus macaque
alignments were used as out-groups. If only one out-group (ape or monkey) was
available, direction was determined if it matched one of the two human alleles, in
which case, that allele was declared ancestral. In the case where two out-groups were
available, they would both have to match the same human allele for direction to be
established. SNPs with indeterminate ancestry (and therefore without established
14
direction) were eliminated from the dataset. Of 3,874,080 SNPs in the hg17 dataset,
3,424,895 had a direction that could be determined.
It should be noted that the simple methods of determining ancestry can be expected to
result in false positives in some percentage of cases. First, we rely upon the accuracy
of the sequencing of each species. Next, we rely upon the accuracy of alignment
between species. And finally, the simple methods ignore the possibility of two
mutations at the same site among 3 aligned species will result in erroneous
classification of a human derived substitution. However, that the majority of work
attempts to characterize a genome wide phenomenon and relies upon thousands and
even millions of differences in order to reveal a pattern. The handful of false
positives should be overwhelmed by true positives. Additionally, there is no reason
to expect that inaccurate data would bias the results in a particular direction, but could
be expected to dilute any pattern to be revealed. While this conclusion seems
reasonable when examining patterns in humans, genome wide; caution should be
taken in drawing conclusions about two types of analysis. First, when specific
regions are examined, fixed substitutions and SNPs should be recharacterized in order
to establish confidence. Second, when examining the fixed substitutions data from
the perspective of chimp evolution, the lower confidence in the chimpanzee sequence
should be considered. If a base pair is AT in human and macaque, but GC in chimp,
then the difference might be due to either a fixed substitution or a sequencing error in
15
chimps. For this reason, the majority of work concentrates upon characterizing
patterns found in the human genome.
2.2 Three Lenses to View the Secrets of Bias
Three separate methods were used to attempt to view the characteristics of bias across
the human genome. The first two methods attempt to characterize bias in terms of its
effects upon roughly similar objects across the whole genome, while the third method
attempts to locate the regions most affected by bias. Though methods and results are
presented as if the three analyses were performed sequentially, in reality each method
was altered somewhat based upon the results found in the other two.
While it should be relatively easy to count the biased changes in a set of 10 million
fixed substitutions, such an analysis would miss any patterns that involve multiple
substitutions located in close proximity. It is also easy to generate a set of clusters of
substitutions, but those clusters can be expected to have a range of sizes and densities
(12 substitutions within 86bp vs. 4 substitutions within 293bp). In order to
characterize bias in clusters in terms of a dataset of statistically similar objects two
methods were used: windowing and filtering by nearest neighbors.
2.2.1 The Window Method
A simple windowing method was used in order to characterize bias systematically
across the whole genome. This method, which is capable of illuminating the
16
clustering of differences, does not assume that bias is related to clustering at all. The
entire genome was broken into windows of fixed length and fixed sliding or stepping
increment. The advantage of overlapping windows is that clustered differences are
less likely to go unrecognized due to splitting. Since this analysis used windows that
overlapped by half, a distinct disadvantage is that the vast majority of substitutions or
SNPs are counted twice. The result is that low density window counts are
approximately doubled while high density window counts may be somewhat less than
doubled. Nevertheless, clusters of differences should rarely escape detection. All
windows without a single base change were dropped from the analysis. For all results
discussed in this document (unless otherwise noted), a window of 100 stepping 50
was used to cover each of the chromosomes. An original analysis of windows of
300bp stepping 150 was performed, based upon the approximate mean size of a gap
subject to BGC due to a recombination event.[28] However, this original analysis
revealed the strong relationship between bias and density of substitutions. Further
analysis, discussed below, led to choosing windows of 100bp stepping 50 as a more
appropriate lens. Using 100/50 windowing should result in 61,535,590 possible
windows of the human genome (hg17). For the fixed substitutions dataset,
16,633,481 windows were discovered with at least one substitution in the human line
(Table 1 of results). Analysis of SNPs used windows of 300bp, stepping 150 for
20,511,865 possible windows, 1,900,453 of which contained at least one SNP of clear
ancestry. The actual stored window data consists of a location, a size, and the raw
counts of the 12 possible base changes. Additional fields for current G+C count,
17
conservation score, and whether the window is in a recombination hot spot or is
telomeric are included as described below.
2.2.2 “Biased Clustered Substitutions”
or How Filtering by Nearest Neighbors Reveals UBCS
While the windowing method allows examining clustered and non-clustered
substitutions in a statistically neutral manner, it fails to adequately capture all
substitutions that might belong to a single cluster. Additionally, overlapping
windows, as explained above, overestimate low density windows as compared with
windows containing clusters of differences. Therefore, a second method of viewing
bias across the genome was developed. Having demonstrated with the windowing
method that bias is associated with clustering, this method targets clustering as the
most recognizable dimension of bias. Each individual substitution (or SNP) was
considered as to whether it belongs to a cluster based upon its nearest neighbors. This
allowed for the systematic description of bias for clusters of from 2 to 10 differences
within 20 to 600 base pairs. It should be clear, however, that this characterization is
fundamentally of individual differences and not clusters. For example, each
substitution was considered as belonging to a cluster of 7 substitutions by looking at
its absolute nearest 6 neighboring substitutions. If a substitution qualifies as
belonging to a cluster of 7 within 120 bases, its six neighbors which make up that
cluster may not qualify as part of the same cluster! For instance, a substitution at one
edge of that “7 in 120” cluster may actually qualify as belonging to a cluster of 7
18
within 80 bases, while a substitution near the other edge may be seen as belonging an
entirely different cluster of 7. Thus, one substitution may find itself in a highly
biased cluster, while its very nearest neighbor is in a less dense cluster that is not
biased at all! However, on the whole, this method has proven beneficial in
characterizing the magnitude of bias as a function of both the number in a cluster and
width of a cluster. It was this analysis that led to the readjustment of the window size
from 300 to 100 bases in the window method described above.
A second benefit of this view of substitutions and SNPs is that mapping of the
locations of bias across the genome could be undertaken as a simple histogram. By
defining some minimum threshold required to be considered a member of a cluster
and to be considered a member of a biased cluster, the dataset could be “filtered” into
subsets: “clustered substitutions” and “biased clustered substitutions.” In this
analysis, a cluster was taken to be “at least 5 differences within 300 bases” while a
biased cluster was considered to be “a cluster with at least 80% Weak to Strong”. To
be clear, a “biased clustered substitution” (BCS) must be a part of a cluster, and the
cluster itself must be biased, though the substitution itself need not be. Therefore, if a
substitution’s nearest 6 neighbors are within 260 bases and that set of substitutions
contains 6 weak to strong, then it is clearly a “biased clustered substitution”, while
another substitution may not qualify as belonging to a cluster if its nearest 4
neighbors span 306 bases or it may not qualify as belonging to a biased cluster if its
nearest 6 neighbors are within 280 bases but only 5 of that seven are weak to strong
19
changes. Using these definitions, it was possible to filter the substitutions dataset and
generate a set of substitutions that are clustered and another that are members of a
biased cluster.
While the definition of a biased cluster of differences proves useful in illustrating the
location of weak to strong bias, it begs the question of how many biased clustered
substitutions can be expected by the null model. Expected biased clustered
substitutions (or SNPs) can be considered to be a function of both the probability of
clustering and the probability of weak to strong substitutions. An estimate of the null
model frequency of clustering is generated by considering clustering to be a Poisson
process which starts at each difference and uses the rate of substitutions for lambda to
calculate the probability of at least 5 substitutions within 300 bases. Likewise, an
estimate for the null model frequency of biased clustered substitutions could be
generated by starting with the estimated frequency of clustering and applying a
binomial probability that the differences will be at least 80% weak to strong.
However, the resulting estimate of the expected biased clustered substitution count
would mix both the phenomenon of clustering and the phenomenon of bias into the
equation. However, this analysis attempts to characterize forces associated with weak
to strong bias independent of forces that lead to clustering of substitutions. Indeed, a
theoretical cause of biased clusters, BGC, should act upon existing clusters of SNPs.
For this reason, the estimate of expected biased clustered substitutions is here
generated using actual, not expected clusters. For example, given that a certain
20
region of the human genome has 200 substitutions in clusters, and given that 43% of
the substitutions in that region are weak to strong, the expected frequency of biased
clustered substitutions in that region is 200 times the cumulative binomial probability
of at least 4 of 5 substitutions will be biased. We can expect 22.4 biased clustered
substitutions in this region, according to the null model. Using this estimate, we can
calculate the amount of “unexpected biased clustered substitutions” (UBCS) as actual
BCS, minus expected BCS. While BCS is an actual count of substitutions, UBCS is a
calculated number which may be either positive or negative and would be zero in the
null model. A large scale analysis of the genome was undertaken by mapping the
distribution of substitutions (or SNPs), biased substitutions, clustered substitutions,
biased clustered substitutions and unexpected biased clustered substitutions across
each of the 24 chromosomes. Analysis of UBCS makes up the majority of this
research, and the resulting discoveries are most revealing.
2.2.3 Finding Regions of High Density of Bias
The third method of characterizing bias attempts to find the regions in the human
genome with the most significant changes due to clustering of biased substitutions. It
involves generating a list of the longest clusters of substitutions within the human
genome which contain a minimum density of differences, then ranking the list to find
the most biased clusters. This analysis, originally conducted on hg16, but updated to
hg18 looks for clusters which contain at least 6 differences derived in humans with a
density of no less than one difference in 32 bases. These initial high density clusters
21
were extended out as long as the region maintained a density of 1 difference per 32
bases with no barren stretch longer than 96 bases. The patches were then carved
down to maximize a score of bias for each cluster in the list. The score or “P value”
used to rank these clusters was the cumulative binomial probability of the biased
substitutions to all substitutions in a cluster. By using a binomial instead of a Poisson
score, larger clusters are favored over shorter but denser clusters.
While the methods described here for finding high density biased regions are perhaps,
overly convoluted and non-intuitive, the purpose of this exercise was to locate some
of the regions of the human genome which have been most altered by the force that
has created biased clusters genome wide. In this, it has succeeded with interesting
results, as will be shown. It should be noted that this list was generated agnostic to
any other factor beyond density of biased differences. No measurement of
conservation was used to generate, filter or score the list of most biased regions in the
human genome.
However, after examining the top scoring regions, the list was filtered to remove self
alignments and repeats. While there can be some confidence in the quality of point
substitutions (or SNPs) themselves, the methods used to identify them rely upon
sequence alignments of three species. Any alignment errors should be overwhelmed
by successful alignments in genome-wide analysis, but may be more pernicious when
analyzing individual cases. Therefore, any region of bias with more than 10
22
references in the UCSC “self-alignment” track[42] or greater than 90% self alignment
score was eliminated. Self-alignments represent duplications in the human genome
and can result in cross-species misalignments. That said, a smaller number of self-
alignments might actually be expected in a family of closely related genes, and
therefore self alignments should not be eliminated entirely from the list. Clusters
which contained more than 50% repeat coverage in the RepeatMasker[43] track of the
UCSC Human Genome Browser were also eliminated. While the unfiltered list of top
scoring regions does identify interesting aspects of the most biased regions, the
filtered list does sharpen the focus further.
2.3 Searching for a Relationship
In an attempt to characterize weak to strong substitution bias, it is desirable to
determine if there is any relationship between patterns of bias and other key factors.
2.3.1 Clustering
As already mentioned, the clearest relationship between weak to strong bias and
another factor is the clustering of substitutions. All three methods of viewing the data
described above confirmed the importance of clustering. The search for regions of
high density of bias was predicated upon this relationship and filtering by nearest
neighbor was designed to most fully characterize this relationship.
23
2.3.2 G+C Content
Because the biased substitutions analyzed here are changing the G+C content of the
local sequence, it is only natural to ask if the G+C content of the local sequence is
influencing the accumulation of bias. Several factors might influence such a
relationship. First, the existence of isochores and the mystery of their origin begs the
question of whether they are currently increasing in bias. Second, any other force for
selection of G+C may be acting over time, and may be revealed by a relationship
between high G+C and new biased substitutions. However, even if the cause of
biased substitutions in humans has nothing to do with isochore or other selection
forces, it should still be expected that the tendency of weak to strong changes will be
affected by background G+C content. That is, if a region of DNA is already highly
G+C enriched, and several “random” substitution events occur, then the null model
would predict that there should be more strong to weak events than the opposite,
simply because there are more strong base pairs available to be changed to weak
ones. Therefore, background G+C content was considered in both the windowed
analysis and in mapping of clustered substitutions.
Each window was updated with the amount of G+C found in the human sequence.
When windows of 300bp were used, then the G+C content was simply a measure of
that 300 bases. However, when windows of 100bp were used, the G+C content was a
measure of the amount of G+C found in a window of 1000bp with the 100bp window
at its center. Windows in a bed file format were fed to Daryl Thomas’ hgGcPercent
24
program to generate a raw G+C count of bases which was then used to update the
original windows dataset. However, analysis was done with ancestral G+C count,
rather than the current count. The calculation of ancestral G+C count is simply the
current G+C count plus Strong to Weak changes and minus Weak to Strong changes.
The advantage of this calculation of ancestral G+C is simplicity. The disadvantage is
that the base changes for which no out-group species was available or direction could
not be determined are counted at their current G+C. While some distortion can be
expected from this method, the amount of distortion should be of little significance.
Average distortion can be expected to be less than 0.35% for substitutions and much
smaller still for SNPs. Additionally, there is no reason to believe the distortion would
be systematically biased to either G+C or A+T.
For calculating empirical probabilities of weak to strong using the window method,
ancestral G+C content allows a more sophisticated analysis of the relationship
between bias and G+C. The simple empirical probability of bias or P(bias) can be
seen as the number of weak to strong substitutions in a category divided by the total
number of substitutions in that category. This value can be adjusted for the expected
changes due simply to background G+C. Thus P(bias | ancestral G+C) is the count of
weak to strong differences divided by all ancestral weak bases.
25
For analysis based upon Biased Clustered Substitutions, a simple percentage of G+C
in a given bin size was used. Analysis by bins from 10,000 to 1 million base pairs
was used.
2.3.3 Conservation Score
Bias found in fixed substitutions may be the result of Darwinian selection. If this
were the case, biased substitutions might be more likely in regions of the genome
which are more highly conserved. In order to determine this, some measure of
evolutionary conservation is required. Conservation scoring was done using the
phastCons methods developed by Adam Siepel.[44] In particular, the original
“conservation” track of hg17 was used which was generated from the alignment of 8
species: human (hg17), chimp (pt1), mouse (mm5), rat (rn3), dog (cf1), chicken
(gg2), fugu (fr1), and zebrafish (dr1). This method generates a score for each aligned
base between 0 and 1. For biased clustered substitution analysis, the average scores
of all bases in bins from 10,000 to 1 million base pairs across each chromosome were
used as a measure of conservation.
In order for a window to receive a conservation score, at least 80% of the bases
covered by that window must have had a score. Bases without a phastCons score can
be considered as having very low conservation, since alignment was not possible for
these sites. However, rather than scoring these bases at zero conservation, they were
excluded from the analysis. The window score was the average of the individual base
26
scores. In particular, it was the sum of all scores divided by the count of bases with a
score, not the count of bases in the window. Since unscored bases are likely to be
unconserved, the exclusion of these bases can be expected to raise the conservation
score of windows with less than 100% coverage. It should be recognized that since
the window conservation score is an average of the scored bases it contains, the larger
the window size the more muddied this score becomes. All windows failing to meet
the 80% threshold received a zero score and were excluded from the conservation
analysis. Of 16,633,481 fixed substitution windows, 15,245,997 received a
conservation score, while only 1,387,484 (8.34%) failed to reach the 80% threshold
of bases with phastCons scores.
In the course of this analysis, several different sets of species were tried for the
purposes of assigning conservation scores. The original plan was simply to use
human, chimp, mouse and rat. However, while this set of species effectively
recognized far more of the human genome as “conserved”, it also resulted in far more
false positives, and an inability to easily distinguish the most dramatically conserved
areas. If you imagine using only human and chimp, it is clear that well over 90% of
the genome would appear conserved. By using the 8 vertebrate species listed above,
some depth to the conservation data can be obtained. While it is not possible to
ensure no evolutionary pressure is selecting changes in the “unconserved” windows,
the score assigned to each window is a direct measure of the probability that the bases
within that window are conserved.
27
2.3.4 Telomeric Regions
A window or a high density patch was considered telomeric (or sub-telomeric) if it
overlapped the first or last chromosomal band by even a single base. Of the
16,633,481 windows of fixed substitutions, 1,030,290 (6.19%) were found to be in or
overlapping a chromosomal telomere.
2.3.5 Recombination Hot Spot Location
A “bed” file covering recombination hot spots[36] was used to determine whether
windows fell in hot spots. If 50% or more of a window overlapped a hotspot, this
window was considered as belonging to a hot spot. For fixed substitutions, 1,535,478
of 16,633,481 windows, or 9.23% were hot. For SNPs, 182,992 of 1,900,453 or
9.63% were hot. Additionally, the distance of a window to its nearest hot spot was
analyzed. For the Biased Clustered Substitution analysis, the number of bases
belonging to a hot spot within a particular bin was used to measure the association
between hot spots and biased clustering.
2.3.6 Recombination Rates
For the biased clustered substitutions view, recombination rate data from deCODE[37]
was used. Recombination rates were available for males and females separately as
well as the sex averaged rate. The data came in the form of a rate averaged across 1
28
million bp segments of the genome. While this prevented fine detail correlations
across the whole genome, the data proved revealing even in 1mbp granularity.
2.3.7 Transcription Density and Transcription Evidence
Transcription Density was a measure of the number of bases in a region which are
found in one of the following UCSC browser tracks: Known Gene[38, 39, 40], Human
mRNAS[41] or Human EST.[41] It should be clear that this analysis does not cover the
rate of transcription, but only whether some evidence exists in humans that
transcription occurs. Transcription Density was used in the Biased Clustered
Substitution analysis as a base count in bins of from 10,000 to 1mbp. Additionally,
the top scoring regions of bias were examined as to whether they showed evidence of
transcription. For this, a descending hierarchy of transcription evidence was sought
as follows: known exons, known genes, human mRNAs, human ESTs, non-human
mRNAs and non-human ESTs.
2.4 Statistical Tools and Visual Aids
The windowing method was analyzed by basic empirical probabilities, which were
plotted across a range of factors. The Biased Clustered Substitutions method
involved a large number of plots to show location of features within the genome, as
well as the correlations of certain factors across a number of bin sizes. Close to 1500
graphics were generated in order to characterize bias in the human genome. The
graphical package R was used to generate all plots.
29
2.4.1 Window Based Statistics
All windows with at least one base change were used for gathering statistics. The
following empirical probabilities were of interest:
1. P(W to S | change): count of Weak in ancestor and Strong in descendent divided
by all base changes in a window. Likewise P(S to W | change), P(S to S | change)
and P(W to W | change) were calculated. Also referred to as P(Bias).
2. P(S | aW and change): count of Weak in ancestor and Strong in descendent
divided by all base changes which were weak in ancestor. Likewise P(W | aW
and change), P(W | aS and change) and P(S | aS and change) were calculated.
Also referred to as P(Bias | anc., change).
3. Normalized P(S | aW): count of Weak in ancestor and Strong in descendent
divided by all weak in ancestor, whether mutated or not. While P(W | aS) was also
calculated, the magnitude of P(W | aW) and P(S | aS) which do not intrinsically
involve a base change were not. Given that the calculated probability for the
window is dependent upon the number of base changes in that window, this
statistic is normalized by dividing a window’s probability by the number of
changes in that window. Though probabilities are calculated by window, the
resulting empirical probability is that a given base change in that window will be
biased, rather than the probability that a biased change will occur in that window.
Also referred to as P(Bias | anc.).
30
The probabilities were calculated for a number of different “categories”:
1. Entire dataset: For all windows in the human genome with at least 1 base change,
the 12 different base change types were summed and the empirical probabilities
were calculated. The empirical probabilities were determined for individual
windows, then the mean and standard deviation were generated for all windows.
That is, the resulting statistics were for sums of probabilities, not a single
probability of sums. The calculated standard deviation allowed for representing
standard error bars in plots.
2. Windowed substitution count. All windows with a single base change were
grouped separately from all windows with 2 changes all the way up to the
maximum number of changes per window. For 300/150 windows, the maximum
number of fixed substitutions in a single window was 26, while only 10 SNPs
were found in a single window. For windows of 100bp, 12 was the maximum
number of fixed substitutions found.
3. Ancestral G+C percent: All windows were divided into 10 bins for the percentage
of ancestral G+C found. That is, those windows with less than or equal to 10%
were grouped separately from windows with greater than 10% but less than or
equal to 20%, continuing through those windows with greater than 90% ancestral
G+C content.
4. Conservation Score: All windows for which a conservation score could be
calculated were divided into one of 5 bins according to conservation score.
31
Windows were considered to be low conservation if their average conservation
score was less than 0.2.
5. Hot Spots: All windows which overlap hot spots were summed separately from all
windows not overlapping hot spots.
6. Telomeric: All windows which overlap telomeres were summed separately from
all windows not overlapping chromosomal telomeres.
Additional categories were analyzed as combinations of the primary categories. For
example, the G+C categories were further broken into base change count categories.
Window based plots were generated genome wide for fixed substitutions derived in
the human line for windows of 300 and windows of 100. Additionally, most of the
same plots were generated for human SNPs for windows of 300 base pairs. Plots of
empirical P(Bias) and Normalized P(Bias | Ancestral) with standard error bars were
generated for the number of differences per Window, G+C Content, Conservation
Score, Hot Spots and Telomeric Location. In addition, the Average Distance to a Hot
Spot, Average Proximity to a Telomere, Average G+C Content and Average
Conservation Score were all plotted vs. Window Substitution Count. A total of 82
plots were made for this analysis.
2.4.2 Analyzing UBCS with Zippers and Maps
Biased Clustered Substitution Analysis involved plots for each of the 24 human
chromosomes as well as the Whole Genome. All plots were made for fixed
32
substitutions derived on the human line but many were repeated for human SNPs and
for Chimp fixed substitutions.
1. Plots with error bars for empirical P(bias) measured for clusters of 2 through 10
substitutions within 20 through 600 bases were made. The set of nine plots, taken
together were dubbed “zipper plots” for reasons which will become obvious when
the results are examined. While these nine plots showed the effect of substitution
count on bias, an additional 15 plots for the entire genome showed the effects of
cluster span in base pairs. Heat Maps were made of the same information in order
to condense the three dimensional information of 9 plots into one graphic. It was
also useful to generate “normalized” heat maps, which centered the coloring on
the average chromosome bias. These 855 plots served to fully characterize the
dimensions of bias associated with clustering of substitutions and SNPs in
humans (hg17-pt1) and for substitutions alone in chimps (hg18-pt2).
2. Once a dataset of biased clustered substitutions was generated based upon the
definition of 5 differences within 300bp with at least 80% weak to strong changes,
mapping of locations was done by histogram for each of the 24 chromosomes.
Maps of Substitutions, Weak to Strong Substitutions, Clustered Substitutions,
Biased Clustered Substitutions and Unexpected Biased Clustered Substitutions
were generated using 1 million base pair (mbp) bins. In order to simplify the
relatively noisy histogram maps of unexpected bias, a smoothing function was
applied. The loess function of R was used to smooth by “least squares” across a
span of 25 bins, or 25mbp. The smoothed curve was used to generate a 95%
33
confidence interval using +/- 1.96 standard deviations. While this method
assumes a normal distribution of data, unexpected biased clustered substitutions
are not normally distributed. However, the null model would predict a normal
distribution of actual minus expected bias. Unexpected Biased Clustered
Substitutions were mapped together with G+C Content, male and female
recombination rates, and transcription density. Recombination rates and
unexpected bias were additionally plotted using the smoothing function for both
(this time with a span of 15mbp). Maps of individual chromosomes were joined
sequentially into genome wide maps of the various relationships. In all 500
chromosome maps were generated for human and chimp substitutions, and human
SNPs.
3. Correlations were generated between BGC, UBCS and Smoothed UBCS; and Hot
Spots; G+C Content; Male, Female and Sex-Averaged Recombination Rates;
conservation score and transcription density. Pearson’s Correlation Coefficient
was generated for bins of size 10,000 through 1 million base pairs, with the many
results plotted in a single graph, resulting in 75 correlation plots for human fixed
substitutions.
4. Finally, an effort was made to determine whether a second factor might explain
the UBCS signal in combination with a first. After a linear relationship was
established between points on a scatter plot of UBCS and male recombination
rate, the differences between the actual points and the linear approximation can be
understood as the residual signal left unexplained by male recombination rate’s
34
relationship with UBCS. Those “residuals” can then be plotted with a second
factor in order to determine if an additional relationship is involved in producing
the full UBCS signal. The residual signal of UBCS, unexplained by male
recombination rate, was plotted with female recombination rate, G+C content,
conservation score and transcription density.
2.5 Dating the Fusion of Chromosome 2
Clearly the UBCS signal near telomeres dominates the rest of the chromosome, as
seen in Figure 15. A reasonable assumption is that the internal peak of chromosome
2 built up while the region was sub-telomeric in the unfused chromosomes, and has
stopped accumulating soon after fusion. The chimp and human maps of cousin
chromosomes proved remarkably similar in the shape and relative amplitude of
telomere peaks (Figure 19). This allows using the ratio of the height of the UBCS
signal on the telomeres of a chimp chromosome, and the height of UBCS on one of
the telomeres of a corresponding human chromosome, to predict the height UBCS at
the other human telomere. Thus, the expected height of the missing telomeres if there
had been no fusion in the human line could be predicted. Using the set of
chromosomes with substitution data for their entire length for both humans and
chimps (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 16, 17, 20), a standard deviation could be
calculated. Thus, the calculated fusion date is the ratio of the actual UBCS height of
the current human chromosome 2 fusion peak over the predicted UBCS heights of the
hypothetical human telomeres 2a arm q and 2b arm p, times the estimated date of
35
6MYA that the human and chimp lines diverged (Eq. 1). A 95% confidence interval
is achieved by applying +/- 1.96 times the standard deviation of the predictive ability
of the relative telomere heights, to the predictions of the hypothetical human
telomeres. One important part of this calculation involves the size of a region used to
measure the UBCS signal at a telomere. A size of 17mbp was used as the smallest
window where the next successive window’s UBCS signal falls below average for the
genome. Thus, 17mbp at the telomeres of each autosome represents the average
region of heightened UBCS at telomeres in humans. This is roughly 25% of the
human genome.
hg17.chr2.p = UBCS signal of hg17.chr2:1-17000000 hg17.chr2.fus = UBCS signal of hg17.chr2:97000001-131000000 hg17.chr2a.q.hyp = hg17.chr2.p * pt2.chr2a.q/pt2.chr2a.p hg17.chr2b.p.hyp = hg17.chr2.q * pt2.chr2b.p/pt2.chr2b.q FusionRatio = hg17.chr2.fus / (hg17.chr2a.q.hyp + hg17.chr2b.p.hyp) FusionDate = 6MYA + (FusionRatio * 6MY) Eq. 1
36
3.0 Results
From the examination of almost 11 million substitutions in the human genome, it can
be seen from Table 1 that nearly as many weak bases underwent substitution as strong
bases. Further, it is clear that weak to strong substitutions merely balance out strong
to weak ones (43.1% vs. 42.78%). However, in examination of substitutions as a
function of other factors, evidence of bias emerges.
Substitution Totals for Human Genome (hg17)
Point Differences between Humans and Chimps 28,896,677 Substitutions with Rhesus macaque outlier 24,817,827 ( 85.88% )
Substitutions with unambiguous outlier 21,405,843 ( 86.25% )
Substitutions found in Human Line 10,871,681 ( 50.79% )
Ancestral Weak Bases Substituted 5,351,332 ( 49.22% )
Ancestral Strong Bases Substituted 5,520,349 ( 50.78% )
Weak to Strong Substitutions in Human Line 4,685,494 ( 43.10% )
Strong to Weak Substitutions in Human Line 4,650,554 ( 42.78% )
Weak to Weak Substitutions in Human Line 665,838 ( 6.12% )
Strong to Strong Substitutions in Human Line 869,795 ( 8.00% )
Table 1. While about 11 million substitutions derived in humans were identifiable overall, there is no evidence of bias in this aggregate view.
3.1 Bias as a Social Disease
In a quest to find hidden relationships between weak to strong substitutions and other
factors, the entire human genome was analyzed by windows. Initial analysis used
windows of 300bp stepping 150, based upon the approximate mean size of a gap
subject to BGC due to a recombination event[28]; while more recent analysis tightened
the windows to 100bp stepping 50, based upon results to be described later. The most
obvious evidence of bias appears as fixed substitutions are clustered together. In
37
Figure 2 it can be seen that, the empirical probability of weak to strong substitution
rises once a certain substitution density is reached.
Figure 2. The Empirical Probability of Bias Due to Substitution Count. In windows of 300bp, little bias is seen for the first 5 substitutions, but pronounced bias is found between 7 and 16 localized substitutions. In windows of 100bp, the relationship is even more pronounced and is obvious in clusters of from 5 to 11 substitutions. In windows of 300bp, the strongest bias was at 13 substitutions per window with the proportion of weak to strong being about 46.4% while strong to weak was 37.2%. However, using the 8 substitutions as a comparable data point for windows of 100bp, weak to strong substitutions were 49.2% while strong to weak were reduced to 35.9%. Here bias is measured as a simple proportion of weak to strong (red) relative to the three other possibilities (error bars +/- 1 SE). [Available at http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]
3.1.1 Documenting Gang Behavior
In order to illuminate the full dimensions of clustering associated with biased
substitutions, “nearest neighbor” methods were used. This allows examining the
proportion of weak to strong substitutions in clusters of 2 through 10 substitutions. In
Figure 3, the relationship between weak to strong substitutions and clustering can be
38
clearly seen when clusters of 5 substitutions are within 100 base pairs. These plots
taken together as a series were dubbed “zipper plots” for obvious reasons.
Figure 3. “ Zipper Plots” : Bias for Clusters of N Substitutions. No discernible bias is revealed when 2 substitutions are as close as 20 base pairs. The first hint of a relationship between weak to strong bias and cluster ing isn’ t seen until 4 substitutions are tightly clustered. When a cluster of 5 substitutions falls within 100 base pairs, bias is clear ly seen. By clusters of 10, the propor tion of weak to strong substitutions exceeds strong to weak ones for most of the range examined. [Available: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#Zip]
39
The same method was turned on its head to reveal, not the number in a cluster that
results in bias, but the width of a cluster required to see bias. In Figure 4, bias is
barely visible when clusters of 9 substitutions are spread across 300 base pairs, but is
striking when clusters of 5 or more are within 100 base pairs. Figure 5 condenses the
genome wide analysis of the dimensions of clustering which shows evidence of bias.
It is clear from this image that the space in which the majority of substitutions occur,
shows no special bias for weak or strong substitutions. But as this sea of substitutions
reaches the rocky edges where high density substitutions occur, bias is noticeably
tipped in favor of weak to strong substitutions. Curiously at some of the most
extreme densities of substitutions, bias is actually strong to weak. However, this
work does not attempt to characterize that phenomenon.
40
Figure 4. Bias for Substitutions within N bases. When clusters of substitutions are within 300 bases, bias is not seen until 9 substitutions are clustered together. When cluster spread is restricted to within 200 bases, bias occurs in clusters of 7, and when within 100 bases, clusters of 5 or more are clearly biased. The strongest bias observed in the human line, genome wide, is for clusters of 10 substitutions within 80 base pairs. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#WinZip]
41
Figure 5. Cluster Bias Heat Map. The results of the three dimensional analysis of bias as a function of length and number of substitutions in a cluster can be summarized by this heat map. Bias is characterized in the range of 2-10 substitutions (X axis) by 20-600bp spans (Y axis). While the bulk of the range in which substitutions fall shows no tendency for weak to strong substitutions to dominate, the rocky shore is strongly biased to weak to strong substitutions. This heat map has been normalized to put yellow at the genome mean and red and blue at the extremes. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#WinZip] 3.1.2 Biased Groups are Recruited from Unbiased Individuals
While the clustering bias seen here is consistent with the Biased Gene Conversion
model, it is conceivable that it is due to other factors. Certainly mutations might
42
occur in clusters and if the process that gives rise to clustered mutations also biases
them, then this pattern might be expected. Such a process should reveal itself in
biased clustering of Single Nucleotide Polymorphisms. The results, shown in Figure
6, reveal little evidence of bias in SNPs. If biased clusters were due to an underlying
mutation bias, one would expect the levels of bias to be stronger in SNPs than
substitutions, where purifying selection would reduce their numbers. The evidence
instead points to selection of biased clusters of SNPs. Assuming that the forces
resulting in mutations in humans have not altered significantly over the past 6 million
years, the process that results in biased clusters of substitutions is not due to an
underlying mutation bias.
The selection of biased SNPs can even be seen in the coarsest view of the data.
While weak to strong SNPs make up 39.97% of total SNPs, weak to strong
substitutions are 43.1% of all substitutions. There are fully 10% more strong to weak
SNPs than there are weak to strong. But by the time SNPs have been fixed in the
genome as substitutions, the two totals roughly balance. Whatever is selecting weak
to strong SNPs is having the satisfying effect of counterbalancing the underlying
strong to weak mutation bias, at least in recent human evolution. While it is hard to
imagine that the motive force acting upon individual SNPs is selection to maintain the
nucleotide balance genome wide, the result is symmetry none-the-less.
43
Figure 6. Weak to Strong Bias in Single Nucleotide Polymorphisms. Unlike the bias seen in clusters of fixed substitutions, clusters of SNPs (which are not yet fixed) show little tendency to be biased due to clustering in windows of 300bp or 100bp. Additionally the genome wide zipper plot shows little evidence. However, examination of the heat map reveals small pockets of clustered bias. This could be expected in a BGC model, as clusters move towards fixation at recombination hotspots. [http://www.cse.ucsc.edu/research/compbio/ubcs/snp17_w100.html] 3.2 Focusing on Bias through the Window Lens
While the association of weak to strong biased substitutions is clear, other
relationships need investigation. It is entirely conceivable that the same process
which selects biased clusters also selects individual weak to strong SNPs; and these
44
individual biased substitutions have so far gone undetected as having a unique origin,
in the background of unbiased substitutions. For this reason, additional analysis using
the window method may be revealing.
3.2.1 Conservative Bias?
Natural selection might select a biased mutation, or a cluster of biased mutations,
especially if they represent significant changes in a gene. This might be revealed if
bias is associated with conservation. However, no such evidence is found in Figure 7,
which shows weak to strong bias mildly retreats in the face of rising conservation
score. While it might be expected that individual clusters of substitutions might be
strongly selected (for or against), this analysis reveals little tendency for clusters of
biased substitutions to be more favored in conserved regions as compared to
unconserved regions of the genome. If anything, clusters and biased clusters are less
often found in conserved regions.
45
Figure 7. Bias as a Function of Conservation Score. While most windows had conservation scores of less than 0.2 (little conservation), there were still 173,677 windows of 100 bases with at least one substitution which had a conservation score of 0.8 or greater. However the most weak to strong bias is seen in windows with a conservation score of 0.2 to 0.4. The average conservation score falls (slightly) as the substitution count rises, and the P Score (binomial probability which reflects greater bias at lower numbers) shows a very weak correlation with conservation (R 0.016), which would translate to a negative correlation between bias and conservation. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]
46
3.2.2 Do the Strong Convert the Weak?
Since this genome wide effort finds little evidence for a mutation bias or for
conservation leading the selection of biased clusters, the question of isochores again
bubbles to the surface. Is it possible that areas of greater frequency of strong base
pairs propagate or at least maintain themselves by favoring the conversion of the
weak to the strong? This should be seen by an analysis of bias relative to the G+C
content of surrounding region. If such a relationship existed, it might provide insight
into the forces shaping the much larger isochores.
Figure 8. Empirical Bias as a function of G+C Content. Windows of 100bp are analyzed, but the G+C content is taken from a window of 1000bp with the window of 100 at its center. At left, it is not surprising to find that as G+C rises, the proportion of weak to strong substitutions fall. This can be understood when it is recognized that there are fewer and fewer weak bases available to be substituted. The graph on the right compensates for this trend by using a conditional probability. It shows the empirical probability of bias, given the ancestral G+C content of the region. In this view, it is clear that the G+C content of a region does affect the proportion of substitutions which are biased. However, the effect tends to weaken bias, rather than strengthen it. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]
47
But no clear relationship is seen in Figure 8, where bias as a function of surrounding
G+C content in windows of 100bp is examined. Since any affect of nearby G+C
might extend well beyond the windows of 100bp examined, the G+C content reflects
a window of 1000bp with the target substitution window at its center. If G+C content
is large, it is not surprising that weak to strong bias diminishes, since fewer weak
bases are available to be converted to strong. That is why the second plot of Figure 8
is included. Here the empirical probability of bias given the ancestral G+C content is
measured. The conditional probability would result in a horizontal line for the null
model of random substitutions. The proportions of weak to strong and strong to weak
are clearly affected by the G+C content of the surrounding region, but for the most
part the decay of high (or low) G+C areas would be expected. In fact, this graph
suggests that G+C extremes are not only moving back to the middle ground, but are
doing so faster than expected by the null model. Clearly there is no evidence for
isochores growing or even maintaining their current strength here.
But the picture painted by G+C content is not so simple. In Figure 9, it is seen that
windows with more substitutions are found in slightly higher G+C regions. Still,
though clusters may be more likely, biased clusters of substitutions (as measured by P
score) are not more likely to be found in G+C rich areas. The negative correlation of
weak to strong substitutions to G+C (R -0.209) is comparable to the positive
correlation between G+C and the P score of a window (R 0.225). Both are dominated
48
by the inherent bias in substitutions which arises from the percentage of A+T
available to be substituted.
Figure 9. G+C Content Affects Clusters of Substitutions. In Figure 8 there was no evidence that G+C content alone biases substitutions. However, clusters of substitutions are clearly but not dramatically more prevalent in regions of higher G+C content, as seen in the left. There is a positive correlation (R 0.225) between bias as measured by P score (lower means more biased) and G+C content, which means that lower G+C is associated with lower P scores. Again, any relationship is obscured by the inherent substitution bias due to G+C content. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]
49
Having established that higher G+C content reduces the bias of substitutions and is
slightly more favorable to clusters of substitutions, it seems appropriate to revisit the
plot found in Figure 2, which shows the empirical probability of bias due to
substitution count. Using the conditional probability that eliminates the inherent
substitution bias due to G+C, it can be seen in Figure 10 that bias still occurs in
clusters, and not simply due to low (or high) G+C. However, the two additional plots
covering low and high G+C make it clear that high G+C areas are more dramatically
biasing of clusters than are low G+C regions. More specifically, regions of higher
G+C are not more likely to result in bias and are only slightly more likely to result in
clusters, but are clearly more likely to result in clusters that are biased. Thus G+C
content is related to biasing of clusters, but cannot be the only force involved, since
biased clusters occur in both high and low G+C regions. Again, the message is that
the forces that give rise to clusters of substitutions are acting in a different
environment than those that result in sparse substitutions.
50
Figure 10. The Conditional Empirical Probability of Bias by Substitution Count. Re-plotting Figure 2 using the conditional probability eliminates the bias that is due purely to the G+C content of a region (upper left). The result is that clusters are still more biased than sparse substitutions. The fact that the blue “strong to weak” curve dominates most of the graph is not unexpected, since it is well established that strong to weak mutation rates dominate weak to strong. However, it can be seen that the majority of windows have less than 50% G+C content (mean 0.410), which results in weak to strong and strong to weak substitutions evening out genome wide (43.10% vs. 42.78%). Further evidence of the effects of G+C on clusters found in Figure 9 can be seen in the two lower plots. At low G+C (G+C 20-30%; N=884,244 windows) clusters are only marginally more biased than individual substitutions. However, in regions enriched for G+C (G+C 50-60%; N=1,482,460 windows), clusters are clearly more biased than lone substitutions. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]
51
3.2.3 Bias at the Hot Spots and on the Edge of Town
There is no underlying mutation bias that explains the patterns of biased clusters of
substitutions seen. Also there is no evidence that biased clusters are the result of the
selective pressures that have shaped highly conserved regions. Finally, the G+C
content of regions may be associated with biasing of clustered substitutions, but it
clearly does not tell the whole story. The most likely explanation for biased clusters
remains Biased Gene Conversion. If this were the cause, then the patterns of bias
should be the effect of recombination events, and therefore ought to be associated
with recombination hot spots. While recombination hot spots have been mapped in
humans, it should be remembered that hot spots are known to move.[23, 24] The
recombination hot spots that might have given rise to biased clusters of substitutions
may be long gone by now.
As can be seen in Figure 11, clusters are more distant from current hot spots than the
average for all substitutions (substitutions: 71131.6 bp; clusters: 72399.9 bp).
However, though biased clusters tend to be closer than average (61391.8 bp), there is
no correlation between windowed P score and distance to the nearest hot spot.
Looking at SNPs, it is clear that they are closer, on average, to a hot spot than are
substitutions (60203.0 bp). But clusters of SNPs and biased clusters of SNPs are
closer still (clusters of SNPs: 47806.9 bp; biased clusters: 49394.3 bp). Either
recombination hot spots are triggered by clusters of mutations or recombination can
create mutations[45, 46] and hotspots, clusters of mutations. As recombination events
52
occur, clusters become biased, and hot spots move on or disappear. While this might
explain the hot spot evidence seen here, there is not enough evidence to paint a clear
picture.
Figure 11. Hot Spots are Slightly More Biased. Using the conditional probability, it is clear substitutions found in cur rent recombination hot spots are very slightly more biased than substitutions in colder regions. This result is significant, with more than 1.5 million windows falling in hot spots (included er ror bars are too small to see). The plot at r ight shows that clusters of substitutions are actually more distant from current hot spots than the average substitution. But that doesn’ t mean that biased clusters are more distant. In fact, there is no cor relation between the P score of windows and the distance to the nearest hot spot. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]
In contrast to locations near a hot spot, there is clear evidence that location near a
telomere results both in more biased substitutions and in more biased clusters of
substitutions, at least for the less extreme P scores. Figure 12, shows the evidence
that led to the next round of investigation. Where are biased clusters of substitutions
located on chromosomes?
53
Figure 12. Bias at Sub-telomeric Regions. Using the conditional probability at left, it is clear that sub-telomeric regions are more biased, and this affect is not due to G+C content. Using P scores, it is clear that biased clusters are closer, on average, to telomeres than are sparse or unbiased windows. In this plot, the proximity to a telomere is represented as a scale from 0 to 1, with 1 being at the very tip of a chromosome. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html] 3.3 Geographic Distribution of Biased Groups
The window method allows for detecting clusters of substitutions, and therefore
proved useful in illuminating the relationship of bias to clusters. However, the
method inherently under-counts clusters. Therefore another method is used here to
try to quantify biased clusters and map them. As explained in methods, by defining a
minimum requirement for a biased cluster, it is possible to determine, for every
substitution, whether it exists in a biased cluster. For this work, a biased cluster is
defined as at least 5 substitutions within 300bp, and at least 80% weak to strong. By
examining the four nearest neighbors of each substitution, a set of substitutions that
falls within this definition was developed. In Figure 13, this set of “biased clustered
substitutions” was mapped across one chromosome. Chromosome 18 is displayed
54
here because, while not atypical of most of the chromosomes, its coverage is good
and it is small enough that the details of the plots should be visible. Additionally, in
the zipper plots, chromosome 18 revealed some of the strongest cluster bias seen in
this research. In the figure, substitutions, weak to strong substitutions and biased
clustered substitutions are fairly uniformly distributed across the chromosome, and
appear to vary proportionally. However, clustering is both higher than expected by
chance and stronger near telomeres. Indeed, closer examination of biased clustered
substitutions reveals that they, too are more frequent at the telomeres. But it is when
the frequency of “unexpected biased clustered substitutions” (UBCS) is examined,
that two things emerge. First, the strength of this bias is mild through most of the
chromosome. But the accumulation of biased clustered substitutions sharply
increases near telomeres.
55
Figure 13. Mapping Chromosome 18. The histogram at the upper left shows that substitutions (green) have been mapped throughout all of chromosome 18 (except in the highly repetitive region around the centromere). While different regions may vary greatly (each bar is 1mbp), most regions contain between 3500 and 5000 substitutions. Likewise, weak to strong substitutions (gold) and “biased clustered substitutions” (red) are fairly uniformly distributed and follow in rough proportion with the overall substitution frequency. The distribution of clusters, on the other hand (top right, gray), begin to diverge from this pattern; and are more pronounced near the telomeres. The black line in this graph represents the expected frequency of clusters (Poisson, given the frequency of substitutions in a bin). Clusters are much more frequent than chance would predict. In the lower left, biased clusters are also more telomeric (at least for one of the telomeres). The black line in this plot is the expected number of biased clustered substitutions (binomial, based upon both the number of clustered substitutions and the frequency of weak to strong substitutions in each bin). Finally, in the lower right plot, we see unexpected (actual minus expected) biased clustered substitutions (UBCS). The null model would predict a line at zero, yet the biased clustering is occurring beyond what can be expected by chance, across most of the chromosome, and is especially heightened at the telomeres. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr18.html]
56
Figure 14. “Winged Maps”: Unexpected Bias is Predictable. In chromosome after chromosome, the mapping of UBCS reveals the same pattern, mild through the length of the chromosome and elevated at the telomeres. Here, chromosomes 1, 10 and 20 are shown as examples. While the chromosome sizes vary, the pattern does not significantly. The top three images show the raw unexpected biased clustered substitutions (all bars represent 1mbp). The lower graphs apply a smoothing function to the raw signal (least squares using a local window of 25mbp). The smoothed maps use the same coordinate dimensions which helps illustrate that the unexpected clustered bias rises at the telomeres, not at distance from the centromere. The yellow region about the smoothed curve is the 95% confidence interval based upon the 25mbp window. When it is above the zero line, the null hypothesis of random variation is rejected. For these three chromosomes, the null hypothesis is rejected through 61.8%, 79.4% and 68.3% of their lengths respectively, while 50.9% of the whole genome shows elevated bias with 95% confidence. The box around each smoothed curve is the confidence interval for the whole chromosome. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html] 3.3.1 Near Universal Pattern of Bias Leaves Evidence of a Fusion
In fact, the maps of unexpected bias for almost all chromosomes follow this pattern,
and create a visual of “winged” maps. In Figure 14, three chromosomes, side by side,
show the same pattern, even though the lengths of the chromosomes vary. The
granularity of the maps (each bar represents 1 million bases) is kept the same in order
57
to compare the magnitudes. Figure 14 also includes the smoothed curves for the
same three chromosomes, this time plotted on identical coordinate dimensions.
Clearly the elevated clustered bias is not a function of distance from the centromere,
but a property of the telomeres. Also, clearly, the null hypothesis of random
fluctuation can not be accepted for half of the genome.
That the pattern of elevated bias at the telomeres is so regular, leads to a closer
examination of the exceptions which are seen in Figure 15. Chromosome 2 has an
internal peak, which is easily understood as the remnants of the fusion of two
ancestral chromosomes in the human line.[47] It is certainly possible to imagine that
there is some force associated with telomeres and causing biased clusters of
substitutions, and that that force is still acting in the middle of chromosome 2 even
though it is no longer telomeric. However, a simpler explanation would be that the
peak we see in the middle of chromosome 2 represents substitutions that occurred
before the fusion of the two ancestral chromosomes. If this were so, then the pattern
here would be compatible with the fusion occurring relatively recently in human
evolution. While some have suggested that this fusion might have been the speciation
event that separated the human and chimp line[48], one interpretation of the UBCS
results suggests that this can’t be so.
58
Figure 15. Exceptions to the Pattern of Unexpected Biased Substitutions. The predictability of the pattern of where UBCS is located on chromosomes suggests that the exceptions reveal part of the story. In the upper left, all chromosomes are mapped end to end, and the smoothed signal peaks almost always fall upon the borders of chromosomes. However, chr2 (upper right) has a peak in the middle which clearly violates this pattern. It reveals the merger of two ancestral chromosomes which happened on the human line. Most positions on human chromosome 2 to the left of the peak, align with chimpanzee chromosome 2a, while positions to the right align with 2b. The human sex chromosomes also stand out as exceptions. Chromosome Y (lower right), which suffers from a lack of data through much of its length, shows no signal beyond the null model expectation. Chromosome X (lower left) also shows a greatly reduced signal, with only one telomere appearing to deviate from random. These plots on uniform dimensions show the smoothed curve of the raw unexpected biased clustered substitutions (red: least-squares over 25mbp windows), the 95% confidence interval for a sliding window of 25mbp (yellow) and the 95% confidence interval for the entire region (rectangle). The Pseudo-Autosomal Regions (PAR: blue ) have been marked for chromosome X. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]
59
3.3.2 Predictable Males and Enigmatic Females
The human sex chromosomes also break the mold. In Figure 15, the patterns of
unexpected biased clustered substitutions for chromosome Y show little evidence
beyond random. So far, all the evidence uncovered has been consistent with Biased
Gene Conversion as the cause of the clustered bias documented here. This model
involves recombination events which create biased clusters out of unbiased
polymorphisms. But recombination is not expected to happen on the Y chromosome,
since it is at all times haploid. Therefore, the lack of a signal on chromosome Y is
consistent with the BGC model. However, the X chromosome also shows little
UBCS signal, which is not predicted by the BGC model. One would expect half the
signal on X, given that it is only diploid in females, and would therefore have the
opportunity to recombine only half as often as do autosomes. But the actual UBCS
signal on X is much less than half.
There is a loophole in the haploid-diploid contract of the sex chromosomes: the
pseudo-autosomal regions (PAR) of X and Y. These two short regions (2.6 and
0.9mbp), at the tips of both chromosomes align between X and Y and are known to
recombine. It is only in this region of Y that the UBCS signal is above zero.
Similarly the signal on X is strongest near the PAR regions of the X chromosome, as
seen in Figure 15. However, the strength of biased clustering is so diminished on X
that the question must remain open: is there an “X exception” that is not explained by
the BGC model?
60
Figure 16. Zipper Plots of Four Chromosomes. Recalling the zipper plots for the whole genome seen in Figure 3, the examination of 4 individual chromosomes is revealing. While most chromosomes show a degree of bias for clusters closer to the genome wide bias, some chromosomes, such as 18 show dramatic bias (upper left). Chromosome 2, not shown here, also shows dramatic bias. Early analysis of chromosome 19 showed little bias, which proved to be due to high G+C content with its inherent trend of strong to weak substitutions. However, this zipper plot (upper right) clearly shows evidence of the same weak to strong cluster bias seen in all other autosomes. The lack of a signal in the zipper plots for chromosome Y (lower right) is not surprising and is consistent with the BGC model. However, the “X exception” of little or no signal seen in the maps of UBCS is confirmed here. Clearly chromosome X has little or no detectible signal (lower left), and is an exception to the consistent picture of bias seen throughout the rest of the human genome. [All plots available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]
61
By returning to the so called zipper plots, the picture of chromosome X is brought
further into focus. In Figure 16, the tendency for clusters of ten substitutions to be
biased is examined for four individual chromosomes. The plots of all autosomes
show evidence of biased clusters, to varying degrees. Both chromosome 2 (not
shown) and 18 show dramatic bias, whereas the amount of bias detected on
chromosome 19 is smaller than the genome wide average. However, chromosome 19
is noted for having the highest G+C content of any chromosome (47.3% compared to
40.1% genome wide) which made early attempts at finding evidence of bias
unsuccessful. The zipper plots, on the other hand show that chromosome 19 has been
under the same biasing pressures as the other autosomes. No evidence of biased
clusters of substitutions is detectible on chromosome Y in either Figure 16 or the
corresponding heat maps of Figure 17. The heat map for chromosome Y actually
appears to show a reverse bias, compared to the autosomes. This trend is provocative
in its own right but is not addressed in this work. The picture of Y revealed from
these two graphics fits the predictions of the BGC model. However, once again,
chromosome X is an enigma. The zipper plots (for clusters of 10 shown in Figure
16), which are sensitive enough to see bias on chr19, show no such bias on X.
Examination of the heat map for X in Figure 17, confirms the lack of a biased cluster
signal and similarly seems to show signs of reverse bias. Possibly chromosome X is
being molded by competing pressures. It is clear from the lack of wings in Figure 15,
the tangled zipper in Figure 16 and the reverse bias seen in the heat map of Figure 17,
that chromosome X is an enigma. While BGC may yet prove to be the cause of the
62
biased clusters seen in this research, any thorough explanation will have to explain
the X exception.
Figure 17. Heat Maps of Bias for Four Chromosomes. As seen in Figure 16, both autosomes 18 and 19 show evidence of biased clusters. For chromosome 18 (upper left), where other evidence is dramatic, the bright red in the lower corner of the graph indicates the most biased substitutions are found when six or more substitutions fall within 100bp. The same pattern is found in chromosome 19 (upper right), where the strength of bias is reduced, but the association with clustering is clear. The pattern is reversed for chromosome Y (lower right), a phenomenon not addressed by this paper. Once again, chromosome X (lower left) shows little evidence for the clustered bias seen on autosomes, and is more in line with Y. These heat maps have been “normalized” to place yellow at the average bias for the chromosome, while red and blue are maximized for the chromosome extremes. [All plots available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]
63
3.3.3 Are Humans More Biased than Chimpanzees?
Having used the alignment of the human, chimp and macaque genomes as a tool to
examine substitutions in humans, it is only natural to ask if the same analysis can be
done for the other two species. However, the quality of the human genome assembly
(on its eighteenth release) far exceeds the assemblies of chimpanzee (second release)
and Rhesus macaque (first release). While there can be some confidence that a single
nucleotide difference found in humans but not chimps or macaques, is indeed a
substitution in humans, the reverse case may actually be a sequencing error in either
chimp or monkey. Also, while there is no reason to expect that the false positives of
sequencing errors will be biased towards weak or strong differences, it is believable
that false positives will show up more frequently in clusters. In addition, the known
SNPs in the human genome could be subtracted out of the human substitution
database, purifying the data further. However, the set of possible substitutions for
chimps or macaques cannot currently be cleaned up in the same way. Nevertheless,
the ratio of substitutions to false positives in a full genome analysis may allow
detection of biased clusters. Using the most recent release of the three genome
assemblies, substitutions were examined across the chimpanzee genome. As
expected, the substitution count (including false positives) was higher (115% of the
human substitution count) and the clustered substitution count, higher still (138% of
human). While biased clustered substitutions exceeded those found in humans
(123%), unexpected biased clustered substitutions, which should represent those not
produced by chance were only 59% of what was found in humans. This doesn’t mean
64
that humans are more biased than chimps, of course, but clearly illustrates the
noisiness of the chimpanzee’s data.
Figure 18. Biased Clustered Substitutions in the Chimpanzee Genome. Unlike the signal seen for the human genome in the zipper plots (upper left vs. Figure 3) or heat map (upper right vs. Figure 5), there is much less detected bias for clusters of substitutions in the chimpanzee genome. However, there is still a striking telomeric rise in unexpected bias on all autosomes for which data was available. This pattern is entirely in parallel to the UBCS signal seen in humans (lower left and right vs. Figure 14 and Figure 15). While the human plots were generated using May 2004 human (hg17), Nov. 2003 chimp (pt1) and Jan. 2005 macaque (rh0 prerelease) assemblies, the chimpanzee plots were generated using Mar. 2006 human (hg18), Mar. 2006 chimp (pt2) and Jan. 2006 macaque (rh1) assemblies. Sufficient high quality scores for chimp substitutions were unavailable for chr21 and chrY. [http://www.cse.ucsc.edu/research/compbio/ubcs/pt2_MapZip_0.html]
65
In Figure 18, we can see the evidence of the reduced signal of clustered bias. While
weak to strong bias only shows up in the extreme clusters, it is nevertheless evident
when unexpected bias is mapped across the genome. The overall signal is reduced
for unexpected bias (mean chimp: 11.8; human: 19.9), but what is striking is the
degree to which the shape and amplitude of the curves agree between chimp and
human chromosomes. The phenomenon can be seen in Figure 19, which compares
chimp chromosomes 2a and 2b (artificially fused), to human chromosome 2. The
Pearson’s correlation coefficient for the two sets of smoothed curves that cover most
of the two genomes is R=0.877! Remember that the signals being examined are made
up entirely of substitutions that have occurred since the human and chimpanzee lines
have split, and the datasets do not share a single aligned location between the two
species. While it is easy to suggest from this comparison that the internal peak of
human chr2 is due to the region having been telomeric for much of the last six million
years; it is also clear from the similarity of minor peaks, that proximity to a telomere
alone cannot explain the accumulation of biased clusters. Another implication that is
hard to ignore is that the force that has given rise to the bias seen here must have run
a parallel course in our two species, despite the unique population histories we have
had. Finally, using the chimpanzee genome, we can also examine the “X exception”.
In Figure 20, we can see that there is virtually no detectible unexpected cluster bias in
the chimpanzee X chromosome. Thus, in two species, the X chromosome is clearly
much less biased that autosomes.
66
Figure 19. UBCS Profile is Similar between Humans and Chimps. While the fact that the chimpanzee chromosomes 2a and 2b both show elevated bias at their telomeres is not surprising, the overall profile of unexpected bias on these two chromosomes is remarkably similar to the fused human chromosome 2. The artificially fused maps of chimp 2a and 2b at left, and the human chr2 at right have a similar set of local minima and maxima, as well as a remarkably similar height to their peaks. There is surprising agreement among many of the autosomes across the two species. [http://www.cse.ucsc.edu/research/compbio/ubcs/cmp_MapUS_All.html]
Figure 20. The “X Exception” in Chimpanzees. While unexpected biased clustered substitutions were detectible in all of the chimpanzee autosomes for which data was available (Figure 18), the X chromosome stands out by showing virtually no unexpected bias (unsmoothed mean: -1.87, genome wide: 11.2; null hypothesis: 0). This pattern is even less ambiguous than what is seen in the human X chromosome (Figure 15), and should stand as confirmation that the force creating biased clustered substitutions is not happening in X. Smoothing was by least squares for 25mbp with 95% confidence in yellow and chimp pseudo-autosomal region in blue. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/pt2_MapZip_chrX.html]
67
3.3.4 Biased Without a Cause or Are Boys Troublemakers?
At this point the evidence shows a pretty clear picture of where bias is found and how
unexpected that bias is. But we are left with a clear pattern of bias in two species, a
possible cause in BGC, and a single complication to that picture: the X exception. In
order to elucidate any hidden relationships with recombination, selection or G+C rich
isochores, several factors were mapped across each chromosome and the correlations
between the changes in signal strengths and unexpected clustered bias were
measured. In Figure 21, the relationships are examined for chromosome 18, which
has been shown to have a particularly strong signal. While unexpected biased
clustered substitutions do seem to magnify the highs and lows of G+C content, as
might be expected if it were the result of an isochore selection pressure, this
relationship does not hold for the telomere of the short arm at all. The relationship
between biased clusters and recombination hot spots, does follow through the
telomeric rise, as the BGC model predicts. However, as can be expected, given the
historical nature of the substitution signal vs. the current events that the
recombination hot spots represent, the relationship is far from lock step. This makes
the result of the sex differentiated recombination rates more startling. For
chromosome 18, the current male recombination rate correlates to the unexpected
biased clustered substitutions signal at R=0.694, while the female recombination rate
only correlates at R=0.160. Of all the correlations measured in this research, the
current male recombination rate shows the clearest relationship with clustering of
biased substitutions!
68
As can be seen in Figure 22, this relationship holds genome wide. Biased clusters of
substitutions do not correlate with conservation scores at all, and are only mildly
correlated with G+C content. The strongest relationship is with recombination rates
and most especially with male recombination rates. While the recombination rates
were measured as current phenomena[37], an additional word of caution is needed.
The rates were averaged across one million base pair sections of each chromosome.
This granularity agrees with the bins used throughout the chromosome mapping
shown here. However Figure 22 plots the changes in correlations as the bin sizes
range from ten thousand to one million bases pairs (1mbp). While the correlations of
mildly related factors might rise as the bin sizes grow and the number of bins drops
towards the singular; for tightly related factors, a correlation would be expected to
remain constant, or even weaken as the data becomes less granular. Thus, since G+C
was measured in one thousand base pair blocks, if there was a strong relationship
between G+C and biased clusters, that relationship might be expected to fall as the
bin sizes rose from 10kbp to 1mbp. Yet, this is not the case, suggesting that any
relationship between G+C and UBCS does act in regions of hundreds of thousands of
bases. However, since the recombination rate data is limited to a granularity of
1mbp, the best correlation possible should be found at 1mbp as is seen.
69
Figure 21. Correlations of Biased Cluster ing on Chromosome 18. In order to elucidate the relationship between biased clusters of substitutions each chromosome was mapped and correlated with different factors. For chromosome 18, which shows strong UBCS, the trend of bias to magnify and exceed G+C extremes can be seen to some degree (orange, upper left). However , the Pearson’s cor relation coefficient between G+C content and unexpected biased clustered substitutions is only R=0.257. The relationship with cur rent recombination hot spots (gray) proves closer , with R=0.465. However , it isn’ t until mapping of sex differentiated recombination rates that a strong cor relation is seen (lower left: smoothed by least squares in 15mbp windows). While the cor relation for the smoothed female recombination rate (pink) is R=0.194, the male recombination rate (blue) cor relates at R=0.961! This relationship holds up in the unsmoothed data as well (female R=0.160; male R=0.694). These correlations are especially str iking when it is realized that only cur rent recombination rates can be measured, while the signal seen in substitutions has accumulated over six million years! [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr18.html]
70
Figure 22. Correlations of Unexpected Biased Clustered Substitutions Genome Wide. This graph plots the Pearson Correlation coefficient R between Unexpected Biased Clustered Substitutions to a number of factors for bins of sizes ranging from 10,000bp to 1,000,000 bp. There is no correlation between mean conservation score and unexpected bias (green at bottom), or for the number of bases transcribed in a region (purple). Correlation with G+C content is mild (gold in the middle), rising to R=0.306. The sex-averaged recombination rate (black) rises to R=0.410. However, when the recombination rate is broken down between male and female, a striking difference emerges. While the female recombination rate (pink) peaks at R=0.177 the male recombination rate (blue) shows the greatest correlation to unexpected biased clustered substitutions of any factor. The highest correlation is found at R=0.524 for bins of 1 million base pairs. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#CorrU] The strongest correlation being between unexpected clustered bias and male
recombination rates, which is genome wide, is seen on all but one of the autosomes
71
for which there is complete data. (Chromosome 2 is the exception which is discussed
below.) However, it is clear that this relationship is largely due to the events that
occur in the telomeres. Indeed, the correlation advantage of male over female
recombination rates is completely wiped out if 18mbp of telomeric ends are removed
from all of the chromosomes. However, the dependence of the correlation on
telomeric location is not enough to discount the importance of this relationship. That
male recombination rates are elevated at the telomeres is well established[49] and a
correlation between BGC and male recombination has also been documented in
human Alu repeats.[50] Most tellingly, a relationship between male recombination
rates and bias would explain the X exception very nicely. Male recombination rates
are a measure of recombination in the creation of male gametes. If biased clusters of
substitutions are the result of biased gene conversion, and this only occurs during
male recombination, then the X chromosome should be largely unaffected by
clustered bias.
3.3.5 Following the Footprints of Past Recombinations
By now the evidence has mounted that biased clusters seen in this research are most
easily explained by the BGC hypothesis. It appears that biased clusters of
substitutions may be thought of as footprints left behind in the genetic record;
footprints which record the recombination events of the past six million years. If this
were true, then it may be possible to use these footprints to date certain events. Just
as fossilized footprints may be dated by what is found above or beneath them, it may
72
be possible to date genetic events by the BGC footprints that fall above or below
them. This trail leads us back to the fusion of chromosome 2. In Figure 15 and
Figure 19, we saw a peak of bias signal in chromosome 2 which divides it neatly
between ancestral chromosomes 2a and 2b. The simplest explanation is that this
strong signal formed before the chromosomes fused, and that male recombination
enabled BGC is strongly affected by positions near telomeres. However, an
alternative explanation is that some feature other than telomeric location causes
heightened BGC and the typical “winged” autosomes. Thus, after the fusion, male-
recombination mediated BGC may have still occurred in the region of the fusion. If
this were so, then we may still find heightened male recombination rates in this
neighborhood of the chromosome 2 fusion. In Figure 23, we can see that male
recombination rates are higher at the telomeres of chromosome 2, but are not
currently higher near the fusion point. This is reflected in the lowest correlation (raw
UBCS R=0.285; genome wide: R=0.524) between male recombination and biased
clusters, of all autosomes for which complete data is available. This does not rule out
the possibility that male recombination remained elevated after the fusion but has
cooled off by now. The simpler explanation, though, is that most of the heightened
bias accumulated prior to the fusion when the regions were telomeric. Thus, Occam’s
Razor applied to the UBCS signal and current male recombination rates for
chromosome 2, suggests the fusion is likely to have occurred relatively recently.
73
Figure 23. Mapping UBCS, G+C Content and Recombination Rates on Chromosome 2. Comparison of the smoothed curves of unexpected clustered bias and both male and female recombination rates is revealing. It is clear that the female recombination rate (pink) shows a mild elevation near telomeres and a steep decline in the last 10mbp from the chromosome tips. The male recombination rate, however, is strongly elevated at both telomeres on this and every other autosome. However, while unexpected biased clustered substitutions are most frequent in the region of fusion of chromosomes 2a and 2b, male recombination rates are not elevated here. While it is possible that this central peak is the result of recombination events that occurred after the fusion, a simpler explanation is that recombination rates changed immediately after the fusion and thus the fusion occurred relatively recently. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr2.html#MapUR]
74
3.3.6 Seeing Ghosts?
It is possible that other factors besides male recombination are also involved in
producing the UBCS signal. One method to reveal a secondary relationship is to
measure the amount of the UBCS signal that is unexplained by a correlation to male
recombination, and compare that secondary or residual signal to other factors. In
Figure 24, the residual signal is plotted against four other factors. We can see that
female recombination explains virtually none of the UBCS that isn’t already
explained by male recombination. The original weak correlation of female
recombination rate to UBCS appears to have been a ghost of the correlation between
the female rate to male recombination rate. Thus, it seems abundantly clear that the
X exception is due to UBCS accumulation in male but not female gamete production.
Also it is clear that unexpected biased clusters are not a product of positive selection,
as conservation is unrelated to the original UBCS signal (Figure 22) or the residual
signal (Figure 24) after male recombination.
However, one factor that does play a role beyond male recombination is the G+C
content of a region (Figure 24, upper right). How this influence is effected is hard to
say. It should be understood, that male recombination could be the event that causes
UBCS, while the G+C content of the region influences the process of that event, and
the probability that UBCS occurs in a particular instance. Another possible
explanation for this secondary relationship may be that G+C has accumulated over
hundreds of millions of years, while recombination rates represent a snapshot of the
75
current state of recombinations. The currently measured rates might represent an
incomplete picture of the force that has lead to accumulation of biased clustered
substitutions over the last 6 million years. A more complete picture of that force may
involve a slightly different distribution of recombination rates. It is well known that
recombination hot spots change over time[23, 24], and that G+C and recombination
rates are correlated.[19] Additionally, while chimp and human UBCS signals are
essentially parallel, there are differences which may illustrate the degree of
fluctuation of recombination rates over time. It seems likely that the fluctuations in
the last 6 million years occurred in regions of higher G+C. It is also possible that the
reason these regions have higher G+C is due to the rising and falling of male
recombination rates in these regions for hundreds of millions of years. This model
suggests that the secondary relationship of G+C content to UBCS is not a cause, but
an effect! And the correlation between G+C content and residual UBCS may simply
be the ghost of BGC past.
76
Figure 24. Effects of other Factors beyond Male Recombination Rates. One way to determine if some additional factor beyond male recombination is related to UBCS, is to plot the second variable with the residual signal left unexplained by the first. In this series of plots, the blue line (common to all four), is the best fit between male recombination rate and UBCS genome wide (slope=0.55). However, the distance between the blue line and the actual data points represents the residual, or unexplained portion of each data point. Residuals are plotted here against four other factors. At the upper left, the female recombination rate does not explain any portion of the remaining UBCS signal (pink, slope=0.01). The G+C content (upper right), on the other hand does appear to have some relationship to UBCS, beyond male recombination (orange, slope=0.17). One possible explanation may be the historical accumulation of G+C. Selection, as represented by conservation score (lower left) is also not evident (green, slope=0.03). One might expect selection to be negatively correlated, however selection may affect too few alleles to be noticeable. Finally, transcription density, as measured by the number of bases that fall into a known gene, human mRNA or EST, does not relate to the proportion of UBCS unexplained by male recombination (lower right; purple, slope=0.01). All plots use normalized coordinates. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#ResUMT]
77
3.4 Humans have been Molded by Bias
If, as the evidence suggests, biased clusters of substitutions are created by
recombination events acting upon clusters of Single Nucleotide Polymorphisms, then
the result should be novel collections of single point mutations which have not
existed in either parent chromosome. That is to say, the two clusters of SNPs found
in the parent chromosomes will have already been tested by evolution, but the newly
recombined cluster of primarily biased SNPs will very likely represent a new and
untested allele. While we may be able to assume that the majority of the individual
SNPs will not be overly harmful by themselves, it is not fair to assume that of the
newly minted collection. If the likelihood that a single mutation is harmful far
outweighs the likelihood that it is beneficial, then an even larger threshold of potential
harm should confront any new cluster of biased mutations. While many biologically
important sequences can tolerate minor deviations, it seems implausible to assume
that a novel collection of five or more changes will be equally benign. However, as
we can see, many examples of biased clusters have been fixed into the human
genome. It is likely that the vast majority of these clusters are of little or no
consequence, falling into the 95% of the genome which shows little sign of being
conserved and whose sequences may serve no critical function in the evolution of
humans. However, it should be expected that when BGC results in a novel cluster of
biased SNPs that fall in an important gene, the most likely result will be that negative
selection removes the cluster from the gene pool. At the same time, it is conceivable
78
that a few newly minted biased clusters might prove beneficial and might sweep the
genome at an even faster pace than BGC alone would account for.
An attempt to find the regions of greatest clustering of biased substitutions was
undertaken. This was done in a method agnostic to the conservation scores of the
regions. Thus, it should be expected that one out of the top twenty regions should fall
into a conserved stretch of DNA. When you include the expectation that a biased
cluster newly arising in a gene would most likely be subject to negative selection, few
genes were expected in the ranking of most biased regions. Instead, the list was rich
with genes and some of the very top scoring regions fell in genes affecting the human
brain! In Table 2, the top 10 scoring regions of biased clustering can be seen. All
have a high density of substitutions which are overwhelmingly from weak to strong
base pairs. At least four and perhaps six of ten fall into genes. Four of these affect
brain development or function. Several of these regions deserve special treatment
here.
79
Location in hg18 Len Subs
Weak to Strong P Value
Significance of Region
1 chr2:113977236-113978604 1369 74 61 1.997E-06 pred:NT_022135.55[51] 2 chr9:2612396-2613708 1313 48 43 9.852E-06 VLDLR (Intron 1)[52] 3 chr20:61203595-61204231 637 37 34 2.543E-04 HAR1 (RNA gene)[3] 4 chr2:117529420-117530267 848 43 38 3.684E-04 pred:NT_022135.102[51]
5 chr2:115136243-115136977 735 39 35 6.744E-04 DPP10 (Intron 1)[53] 6 chr3:213300-214258 959 42 37 7.539E-04 CHL1 (UTR Exon 1)[54] 7 chr2:118333477-118334224 748 32 30 8.011E-04 HTR5B (Exon 1)[55]
8 chr8:1752398-1753394 997 38 34 0.001401 mRNA AF123758[41]
9 chr2:113630962-113631930 969 43 37 0.003132 EST BM926122[41] 10 chr13:112033076-112033732 657 26 25 0.004770 EST BG722997[41]
Table 2. The Top Regions of Biased Clustered Substitutions in Humans. In this list of the regions showing the most substitution bias, a surprising number of genes appear. The second region falls at the beginning of intron 1 of a very low density lipoprotein receptor that may interact with Reelin (implicated in autism, bipolar disorder and schizophrenia). Third is HAR1 discovered by Katie Pollard in this lab, which codes an RNA molecule implicated in fetal brain development. The sixth lies in the first 5’ UTR exon of CHL1, a member of the L1 gene family of neural cell adhesion molecules and is highly active in fetal brains. The seventh on this list falls within exon 1 of human HTR5B, a serotonin receptor which is a defunct gene in humans and chimps, but may very well have been active in our common ancestor and inactivated independently on each lineage (see text). The biased changes to all four of these regions affect human brain development or function. Perhaps less significant is number five which falls into the first intron of DPP10, mutations of which have been associated with asthma; and number eight which lies in the second intron of a putative transmembrane protein that may be implicated in a neurodegenerative disorder. The top region on this list, though not part of a known gene, tells a cautionary tale of its own. [Full list available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub18_d32.html]
3.4.1 Fastest Evolving Region of the Human Genome
The third most biased region found in humans hg18.chr20:61203595-61204231, was
also discovered in a separate search for the fastest evolving, most conserved regions
of the human genome, by K. Pollard from this lab.[2, 3] In order to be considered as
part of that study, sequences had to be highly conserved between humans, chimps,
mice and rats, yet have multiple substitutions which occurred on the human line.
That search was agnostic to bias, yet resulted in a number of biased regions
80
occupying the top spots on the list. The very top scorer was dubbed HAR1 (human
accelerated region 1). Unknown as a gene at the time it was first identified, it has
subsequently been shown to code for an RNA which is expressed in fetal brains.
3.4.2 Serotonin Receptor Knocked Out in Humans and Chimps
The seventh most biased region falls into a pseudo-gene HTR5B, which is an active
gene in mice, but non-functional in both humans and chimps and may have been
destroyed by BGC. In mice, HTR5B translates into a functional 370 amino acid
protein which acts as a serotonin receptor.[55] It is characterized as a G-protein
coupled receptor with a single conserved domain ("7tm_1") containing 7
transmembrane helices. In addition, it has a single disulfide bridge and an N-linked
glycosylation site. The mouse DNA sequence (mm8.chr1:123337934-123355639 –
strand, February 2006 assembly) has 2 exons separated by an intron of approximately
17,000bp. This gene is likely to be non-functional in both humans[56]
(hg18.chr2:118333816-118377624 March 2006 assembly) and chimps
(pt2.chr2b:118470164-118515327 Jan. 2006 assembly), but for different reasons. In
chimps, a single deletion results in a frame shift in exon 1 at approximately 358bp
into the translatable sequence. In humans, a 35bp near tandem duplication and single
deletion in exon 1 at roughly 375bp into the translatable sequence results in a frame
shift. It is clear that these two frame shift events are not shared by humans and
chimps, so the question becomes whether the most recent common ancestor had a
functional HTR5B gene.
81
There is a homologous sequence in Rhesus macaque (rh2.chr13:124036266-
124076728 Jan. 2006 assembly), which is not addressed in the literature. It contains
no frame shifts and GeneWise[57] predicts a protein sequence given the mouse
HTR5B protein. The predicted protein sequence for macaque does include the
conserved domain 7tm_1 and all transmembrane helices appear to be intact.
However, the predicted disulfide bridge is broken, but the N-linked glycosylation site
remains.
A sequence was generated using simple parsimony based upon the human and chimp
sequences using macaque and mouse as out-groups. It covered the entire HTR5B
gene including intron and approximately 150bp upstream and 500bp downstream and
represents the most recent common ancestor of humans and chimps. Unlike the case
for human and chimp DNA sequences, GeneWise can find a homologous protein
sequence in the human-chimp ancestral DNA, using the mouse HTR5B as model.
Using CDD[58], the predicted human-chimp ancestral protein sequence scores slightly
better than the predicted macaque protein in finding the tm7_1 conserved domain. In
addition, the predicted human-chimp ancestral protein retains the N-linked
glycosylation site. But unlike the predicted macaque protein, it retains the ability to
form a disulfide bridge between extra-cellular regions 1 and 3. Thus, the most recent
ancestor to humans and chimps may be considered at least as intact as any Rhesus
macaque HTR5B gene, from this analysis.
82
There are 18 substitutions in the human-chimp ancestral sequence since divergence
from macaque, only 4 of which are weak to strong. Ten amino acid changes would
have resulted from these substitutions, 5 in the first extra-cellular region, one in extra-
cellular region 3 and 4 more in transmembranous regions. None of the
transmembranous changes appear to be disruptive. Only two changes seem
significant: serine to arginine and glutamic acid to glycine, both in extra-cellular
region 1. They have the effect of removing a negative charge and adding a positive
charge for a net gain of +2. It is not clear that this would disrupt function of a
putative human-chimp ancestral serotonin receptor.
Further evidence may come from a closer look at the chimp and human pseudo-genes.
There are 51 substitutions in the human sequence, but no way to determine when each
substitution occurred relative to the tandem repeat and deletion after which the human
HTR5B gene was definitely out of commission. The bulk of the substitutions (38)
occur in the first exon and 35 of these are weak to strong, which accounts for this
sequence being one of the top scorers in the list of high density biased regions (Table
2). The human gene lies in the region of chromosome 2 which was telomeric before
the fusion; and sustained a substantial mount of unexpected biased clustered
substitutions. The other 13 substitutions in human HTR5B are in exon 2, and only 7
of those are weak to strong. Clearly some process was acting on exon 1 which did
not affect the more distant exon 2. Of this biased group of substitutions in exon 1, 30
83
would have changed amino acids if this were still a functioning gene at the time of the
mutation. Many of these changes would have been sterically or chemically
significant including adding 7 (+) charges and 3 (-) charges. Especially hard hit is
transmembranous helix 5 which would have 3 (+) charges added. In stark contrast,
only 1 of the 13 substitutions for exon 2 would change an amino acid and that a
Lysine to Arginine (which are chemically and sterically similar). This alone suggests
that exon 2 was under selective pressure to remain intact during the time of these
substitutions. This would lend weight to the hypothesis that HTR5B was functioning
in the human line more recently than the human-chimp split, though it is no longer a
functional serotonin receptor.
As for the fate of the chimp gene, only 12 substitutions occurred and all of them are
in the first exon, 8 being weak to strong. The process that led to many biased
mutations in the human line seems not to have acted as strongly here. These
substitutions would have changed 9 amino acids if this were a functioning gene.
None of the changes stand out as necessarily disruptive. However, an addition of 2
(+) charges to the extra-cellular end of the protein seems significant. The
substitutions and resulting protein sequence changes are summarized in Table 3. The
fact that the human-chimp ancestral protein already sustained a net gain of +2
suggests that either a higher positive charge was beneficial, or this protein was
already broken by this time. There is some evidence that electron donation from
ligand is a part of binding in the serotonin receptor.[59] In addition, positive charges
84
might repel Na+ ions, keeping the receptor surrounds clear for serotonin molecules.
On the other hand, the receptor may be less receptive immediately following an
action potential.
Human-Chimp Ancestor
Human Chimp
Point Substitutions 18 51 12
Exon 1 Weak to Strong Changed Amino Acids Significance
18 4 10
+2 extra-cellular
38 35 30
many including TM5:+3
12 8 9
+2 extra-cellular
Exon 2 Weak to Strong Changed Amino Acids Significance
0 13 7 1
K→R little effect
0
Table 3. Predicted Changes to a Possible HTR5B Protein Due to Point Substitutions Occurring in 3 Lines. It is unclear whether there was a functioning gene in the most recent common ancestor to humans and chimps. However, the lack of changes in exon 2, and especially the synonymous changes in human exon 2 suggest purifying selective pressure occurred after humans split from chimps. The human line shows strong evidence of BGC, which may have been responsible for destroying the functionality of this gene, even before the insertions and deletion created an early stop codon. The chimp gene, may also have suffered BGC, though the evidence is less convincing. Again, it is unclear whether BGC may have destroyed the putative chimp gene before a deletion introduced a stop codon in a different place than is seen in the human pseudo-gene.
It is noteworthy that in the most recent common ancestor and in chimps, exon 2 was
not changed, while in humans the one change wasn’t significant, as summarized in
Table 3. It is also dramatic how many changes have occurred in the first exon since
our ancestor split from macaques and then again since humans and chimps split. It is
possible that there were destabilizing changes and selective pressure acting on this
area of a functioning protein. However, this is entirely speculative. HTR5B may
have been knocked out in the human-chimp ancestor by a change not examined here,
85
such as disruption of a promoter. One thing that seems certain, is that the exon 1
region of the human (pseudo)gene experienced an unexpectedly large number of
biased clustered substitutions. Given its location in chromosome 2 and the large
number of debilitating biased substitutions it sustained, it is quite plausible that
functionality of the human HTR5B gene was destroyed by BGC.
3.4.3 Mistakes Were Made
Unfortunately, the top scoring region of biased clustering turns out to be a complex
story that is most likely a false positive. This region is one of a set of four sequences
within the human genome that align to each other and have also been identified by
FISH analysis.[60] This “syntetic block” involves sequences which range in size from
160,000bp to 200,000bp and achieve high similarity scores representing more than
90% identity. However, only one of these four regions shows evidence of significant
biased change, and this in a much narrower range of about 1400 bases. However,
while all four regions are quite clearly identified in the March 2006 assembly of the
human genome, only three are found in the January 2006 assembly of the chimp
genome. The result is an alignment error between humans and chimps, and further
error when aligning with the macaque assembly. The large number of weak to strong
changes clustered together in the top scoring region, may be due to the same
telomeric process characterized in this study. The region is substantially altered in a
biased way, when compared to at least two of the other regions in the human genome
from which it was originally duplicated. However, the biased changes may have
86
accumulated over a much longer time that the six million years since the human and
chimp lines split. An alignment error may result in a false positive, not necessarily in
revealing a biased process but in locating that process temporally. The pitfalls of
misalignment, which should not be wide spread when comparing two genomes with
greater than 98% identity[1], do suggest that any list of the most biased regions should
be filtered for self-aligning sequences or repetitive regions. To that end, a final list of
top scorers was generated by filtering out regions with higher probability of
alignment errors (available at
http://www.cse.ucsc.edu/research/compbio/ubcs/sub18_d32nr.html ).
3.4.4 Biased Clusters Are Transcribed
It has already been shown that there are a surprising number of genes among the top
scoring biased cluster regions. If we look for evidence of transcription, the
relationship is hard to ignore. In the top 200 regions of clustered bias, 108 or 54%
occur in known genes.[38] If the net is widened to include evidence of transcription in
humans in the form of mRNAs and ESTs[39, 40], 80.5% of the top 200 regions are
caught. By comparison, random regions fell into known genes 32.5% of the time and
showed evidence of transcription 57.5% of the time as seen in Table 4. While
“known genes” includes introns, the biased regions are also disproportionately found
in exons. While only 3% of randomized regions overlap exons, 15.5% of the top 200
biased regions do. Further, while less than 10% of randomized regions that are in
known genes actually overlap exons, more than 28% of the most biased regions do.
87
While this remarkable relationship is evident in the top regions, there is no correlation
seen between the density of transcribed bases and UBCS (Figure 22). Thus, it would
appear that these top regions have been selected for their influence on transcribed
regions. But there is no corresponding evidence of this selection seen in the residual
signal after male recombination has been accounted for (Figure 24). This could be
due to two things. Either the number of selected regions is too small to influence the
genome wide relationship seen in Figure 24, or the effect of transcribed bases is
already accounted for in male transcription.
Transcription
Evidence Top 200
Regions of WtoS from
Filtered List
Top 200 Regions of WtoS from
Unfiltered List
Bottom 200 Regions StoW
Unfiltered
200 Regions with
Randomized Locations
Known Genes[38] Exon Coding Exon
123 (61.5%) 36 (18%) 18 (9%)
108 (54%) 31 (15.5%) 18 (9%)
69 (34.5%) 12 (6%) 7 (3.5%)
65 (32.5%) 6 (3%) 5 (2.5%)
Human mRNA[41] 24 23 30 23
Human EST[41] 21 30 32 27
non-human mRNA 6 9 14 10
non-human EST 25 29 54 58
No Transcription Evidence
1
1 1 17
Transcription Evidence in Humans
168 (84%) 161 (80.5%) 131 (65.5%) 115 (57.5%)
Table 4. Evidence of Transcription Among Biased Regions. When the top 200 most biased regions of substitutions in the human genome are examined, it is clear that they are unusually likely to be found in known genes or transcribed in humans. A surprising 54% of the top regions are found in known genes[38], while filtering out regions found with repeats or self-aligning sequences raises the known gene count to 61.5%. By contrast only 34.5 of the bottom of the list of regions and 32.5% of random regions are found in known genes. The bottom regions are biased strong to weak, while random regions were generated by taking the top 200 clusters and randomizing their locations within the genome.
88
3.4.5 Currently Bias May be Leading to Thrill Seeking and Disease
In Figure 6, the evidence for biased clusters among single nucleotide polymorphisms
was examined. While the evidence for biased clusters of substitutions is dramatic,
there seems to be little corresponding pattern for SNPs. Nevertheless, the genome
wide heat map (Figure 6, lower right) does suggest the existence of some biased SNP
clusters. Therefore, a list of the most cluster biased regions of SNPs was generated
for hg17 as seen in Table 5. Three of the top 5 regions are found in genes expressed
in the brain, and two of these are implicated in cancers. The top scoring region, falls
in the intron of a gene required for chronic pain perception.[61] While there is no
evidence that the biased changes in these three regions are currently being selected
based upon relative fitness, these are just the sorts of complications that might be
expected when novel alleles occur in genes. It is also possible that the positive
“selection” caused by BGC at a pervasive recombination hot spot, might compete
directly with negative fitness selection on a dangerous new allele. The result could be
a slowing of the purifying fitness selection that should ultimately exclude the gene
from the genome. It is even possible that a weakly negative biased cluster allele will
overcome the fitness hurdle and be fixed in the genome. This possibility illustrates
the innate danger of BGC based selection. In any case, it is clear that biased clusters
of SNPs are occurring in important sequences in the human genome today.
89
Location in hg17 Len SNPs
Weak to Strong
P Value
Significance of Region
1 chr11:84993092-84993239 148 8 8 5.107E-04 CR749820 (Intron 2)[62] 2 chr6:122814786-122814895 110 6 6 0.003397 TDE2 (Exon 5)[63] 3 chr3:79931893-79932009 117 6 6 0.003397 pr:NT_022459.266[51]
4 chr8:3474808-3474931 124 6 6 0.003397 pr:CSMD1(Intron 7)[64] 5 chr9:29893639-29893780 142 6 6 0.003397
Table 5. Most Biased Clusters of SNPs found in the May 2004 release of the Human Genome. The top two regions are both expressed in the human brain, while the homolog of the forth region is expressed in the mouse brain. More ominously, region 2 is “Tumor Deferentially Expressed Protein 2 associated with lung and liver cancers[65], while region 4 may be a cause of oral and oropharyngeal squamous cell carcinomas.[66] [Full list available at: http://www.cse.ucsc.edu/research/compbio/ubcs/snp17_d32.html]
90
4.0 Discussion
Upon realization that the fastest evolving regions of the human genome have a
surprising bias of weak to strong substitutions, this effort was undertaken to
characterize the dimensions of that bias genome wide. Using a windowing approach,
a clear association of bias with clustering of substitutions was established, and is far
and away the most obvious hallmark of biased substitution seen in this study. In
sharp contrast, G+C content of the surrounding window, conservation score and
location in or near a hot spot are all poor predictors of biased substitution. The
dimensions of biased clustering were documented with “zipper plots” and it was
found genome wide and on all individual human autosomes. By using one definition
of biased clusters, a measure of bias which is unexpected from a purely stochastic
process was mapped genome wide. These “Unexpected Biased Clustered
Substitution” (UBCS) maps illustrate a clear tendency of bias to be strongly elevated
near the telomeres of autosomes, with much smaller, non-telomeric peaks also
evident.
4.1 Unexplained Selection
The phenomenon of biased clusters is not evident in Single Nucleotide
Polymorphisms which strongly argues that selection is acting upon unbiased SNPs to
create biased clusters of substitutions. Despite a less reliable dataset, biased clustered
substitutions are detectible in the UBCS maps of the chimpanzee genome, which
shows that the selection pressure is not limited to humans. However, given that
91
neither conservation scores nor transcription density is correlated with elevated weak
to strong bias, the selection pressure is not due to the selection of genes or other
regions that are highly conserved. At the same time, local G+C content, and the
larger isochores are only moderately correlated to bias and do not provide an
explanation for the biased clusters found in the genome. This lack of causation
appears to apply in the other direction as well. The selection of biased substitutions is
not currently growing or even maintaining G+C rich isochores. However, the highly
geographical profile of the UBCS maps suggests an explanation for the existence of
isochores. G+C rich isochores may have been created by the accumulation of weak
to strong substitutions in regions near telomeres over hundreds of millions of years.
Current non-telomeric isochores may have been moved from telomeric regions by
chromosomal rearrangements, and this would explain the finding that they are no
longer growing or being maintained.[67] This theory would predict that telomeric
isochores are still strengthening.
The model that most closely matches the patterns of weak to strong bias seen in the
human genome is Biased Gene Conversion. Under that model, recombination events
will lead to heteroduplexed DNA and the pairing of short stretches of ssDNA from
sister chromosomes, with mismatched bases being “repaired” in a biased manner.
The model directly predicts selection of clusters of biased substitutions which is what
is found here. This “selection pressure” is dependent upon recombination events and
recombination hotspots being widespread among members of a population. However,
92
there is little correlation with the bias found and distance to current recombination
hotspots. This result is consistent with the model that hotspots do not travel but are
eliminated by recombination events.[24] Nevertheless, biased clustered substitutions
accumulated over the last 6 million years are strongly correlated with recombination
rates measured today. If BGC is not the cause of the selection of biased clusters, any
alternative explanation would have to accounts for this correlation.
4.2 Just like Us
The fact that the UBCS curves for human and chimp cousin chromosomes are
remarkably similar, despite 12 million years of evolutionary distance, is startling.
BGC fixation pressure requires a stable recombination hotspot resident in a
significant number of individuals in a population, in order to provide enough pressure
to push a biased cluster over the fixation hurdle. But individual BGC events may not
act upon the same complement of SNPs, so the widespread penetration of a given
hotspot within a population may not be enough to fix a biased cluster. Clearly, the
size of the population should affect the probability that a given biased cluster will be
fixed. What this suggests is that a BGC selection pressure cannot be expected to be
constant over time as population size varies. Instead, the clustered bias signal found
in humans and chimps may have been fixed in the two genomes long ago, during
population bottlenecks.
93
However, the fact that our two species with different population histories, have
similar clustered bias signals across most of the aligned chromosomes seems to
conflict with this view. If the UBCS signal is created in recombination events at hot
spots, and if the locations of hot spots change over time, the two sets of curves should
not be similar. Either the vast majority of the signal was created very soon after the
two lines split, or the two lines must have had very similar population bottlenecks and
hot spot locations through the last six million years. A more likely explanation is that
though hot spot locations are not constant, hot regions in the million base range are.
This interpretation is supported by evidence that regions of high recombination are
persistent even if hot spots are not.[22] There is another implication of the human and
chimp UBCS curve similarity, especially as seen in the similar heights of the telomere
peaks. While population bottlenecks may affect the accumulation of biased clusters,
the effect does not appear to have been large enough to tip the outcome dramatically
in humans and chimps. While it is irrational to abandon the model of a variable rate
of UBCS accumulation over six million years, the implications of the parallel human-
chimp cousins leads to the useful assumption that the rate of UBCS accumulation has
been “relatively” constant over the last 6 million years of human and chimp
evolution.
4.3 Two Chromosomes Come Together on a Date
The distinct winged pattern of UBCS is universal across all autosomes for which
there is data for both humans and chimpanzees. However, human chromosome 2
94
stands out with its internal peak, which so clearly delineates the fusion point of
ancestral chromosomes 2a and 2b. The similarity of the UBCS signals between
humans and chimps leads to the suggestion that the fusion of ancestral chromosomes
2a and 2b occurred relatively recently. Using the assumption of a relatively constant
rate of UBCS accumulation during human evolution, a date of fusion may be inferred.
To be clear, the exercise of dating the fusion relies upon a second big assumption.
The strong telomeric signal suggests that location is an overwhelming determinant of
UBCS. While there are non-telomeric peaks to the unexpected biased clustered
substitution signal, and while these peaks can be parallel in human and chimp
autosomes, they are dwarfed by the peaks at the telomeres. From this, it can be
deduced that, though biased clusters may continue to be fixed in the middle of human
chromosome 2, the rate is not likely to be as large as in the telomeric regions. Indeed,
male recombination rate, which most strongly correlates to the UBCS signal across
the genome, and is elevated at the telomeres, is not elevated in the central region of
chromosome 2. Exactly how long it took for the male recombination rate to
extinguish, once fusion occurred is hard to predict. But the theory that the drop in
male recombination rate occurred due to the fusion is at least as good as alternatives.
Using the assumptions of a relatively constant rate of UBCS accumulation before
fusion and an immediate drop in UCBS accumulation after fusion, it is possible to
estimate the date of the fusion itself. By comparing the UBCS heights of the
95
telomeres on chimp chromosomes 2a and 2b, it is possible to predict the heights that
the missing human 2a and 2b telomeres would have reached if the fusion had never
taken place. Additionally, by comparing the UBCS heights of the other human and
chimp cousin telomeres, it is possible to generate a confidence interval. It is useful to
define a telomere region as 17mbp, which is just large enough to account for the
above average UBCS signal found at telomeres. Using the relative telomere heights,
17mbp of smoothed telomere UBCS, and 6 MYA as the point of speciation, the
calculated fusion date is 0.93 million years ago, but with a 95% confidence interval of
anytime in the last 2.86 million years. While this prediction is dependent upon a
17mbp telomeric region, the prediction never exceeds 1.45mya (CI95: 3.24my) as the
telomeric region size grows to more than 60mbp. While dating the fusion with
confidence is not very precise, it is further blurred by imprecision in dating the age of
the human lineage, which has been calculated to a confidence interval of from 4.98–
7.02 MYA.[68] However, though the predicted fusion date is imprecise, the above
assumptions lead to the conclusion that the fusion almost certainly occurred in the
more recent half of hominid evolution and very possibly since the arrival of the Homo
genus. Clearly, the fusion that created chromosome 2 was not the speciation event
between the human and chimp lines[48], but may have been involved in a more recent
speciation event on the way to making us human.
96
4.4 The X-Exception: Are Men Really to Blame?
While the autosomes follow a consistent path to cluster bias, the sex chromosomes
stand apart. The clear lack of signal on chromosome Y, might seem surprising since
this chromosome has experienced a larger rate of substitution than other human
chromosomes.[73, 74] However, the BGC model which relies upon recombination,
predicts that Y would show no biased clustering beyond stochastic expectations, and
that is what is seen. Since recombination only happens in the pseudo-autosomal
regions between X and Y, BGC predicts an elevated signal in this region, and indeed,
this is the region of Y with the highest elevation in UBCS, though it is not statistically
significant.
The greatly reduced clustered bias on human chromosome X is an enigma, however.
Under almost all measures, X stands out from the autosomes as an exception when it
comes to biased substitutions. The bias signal is so reduced in human X, it is hard to
be convinced that it isn’t just statistical noise. If recombination events are the driving
force fixing biased clusters, then the signal should be reduced by half, since the vast
majority of the X chromosome is only subject to recombination in mothers. It can be
expected to be further reduced, since it is well documented that the X chromosome
experiences a lower mutation rate than the genome as a whole.[69, 70] Indeed, in the
May 2004 assembly, the non-pseudo-autosomal regions (non-PAR) of the X
chromosome have only 52.3% as many substitutions (per mega-base) as do
autosomes. Together, reduced substitutions and reduced recombination should lead
97
directly to reduced UBCS for X. Given that two of these three values are known, it is
possible to calculate a historical recombination rate for non-PAR X of only 30.1% of
the rate for autosomes. However, recombination rates are known to be as much as
1.65 times higher in females than males[37], which would translate to non-PAR X
experiencing 62% as many recombinations as the autosomes. And directly measuring
the current recombination rates in the deCODE[37] dataset shows X recombining
50.2% as often as autosomes. Clearly, the reduced UBCS signal seen on the human
X chromosome is not explained by a simple combination of reduced mutation and
reduced recombination on X.
Unfortunately, substitutions in the human genome alone may not provide a
convincing answer to the question of whether X is experiencing the selective force
that leads to biased clusters of substitutions. One place to seek further evidence is in
other species. While the quality of data is reduced for chimpanzees, there is still clear
UBCS found on all the chimp autosomes, for which there is data. However, there is
no statistically significant UBCS found on chimp chromosome X.
The most satisfying explanation of the reduced signal in X is the strong correlation
between UBCS and the male recombination rate. Indeed, it is the strongest
correlation of UBCS with any other factor examined in this research. If only male
recombination leads to BGC and selection of biased clusters of substitutions, then
there should be no signal on the non-PAR regions of chromosome X. This should be
98
the nail in the coffin. However, this strong correlation is dependent upon near-
telomere regions. The advantage of male recombination rates diminishes as the
telomere regions are trimmed from the dataset, and is completely wiped out in
smoothed UBCS genome wide, when 18mbp of telomere are removed from each
chromosome. Nevertheless, female recombination is not correlated with the
remaining UBCS signal left unexplained by male recombination (Figure 24).
Therefore, it isn’t easy to explain all the evidence without resorting to male
recombination as the cause of UBCS. It becomes necessary to postulate some force
other than recombination that is driving the fixation of biased clusters of
substitutions. That force would have to be strongly telomeric in autosomes; non-
existent in Y; and greatly reduced, or possibly absent in the X chromosome. While
the small signal of UBCS found in human X is an enigma, a preponderance of the
evidence supports recombination in male gamete production as the driving force
causing biased clusters of substitutions to be fixed in the genome. Any remaining
question of whether BGC occurs in male, but not female recombination, will no doubt
be resolved soon.
Nevertheless, the enigmatic X chromosome shows some symptoms. In Figure 15, X
shows some elevation of unexpected bias at the left (short ‘p’ arm) telomere.
Additionally, the heat map in Figure 17, suggests that there may be some weak signal
in some extreme clusters of X, especially in comparison to chromosome Y. An
explanation may come in the Pseudo-Autosomal Regions (PAR) between X and Y.
99
While X-X recombinations do not occur in males, X-Y recombinations can occur in
the two PAR regions at both telomeres. These regions show alignment between X
and Y. However, the elevated wing of chromosome X seen in Figure 15 extends
beyond the PAR region of arm p (bias signal ~11mbp; PAR ~2.6mbp). One possible
explanation might be the degradation of the PAR region since the human-chimp split.
The PAR is believed to have extended to Xp22[71] which would be equivalent to a
25mbp PAR on the p arm of the current human X. The loss of genetic material from
Y is well established and the chromosome may dwindle away within 10 million
years.[72] While no unique profile of Y genes or pseudo-genes have been lost in either
humans or chimps, there is evidence of degradation and extensive reorganization[73, 74]
which may have also disrupted PAR alignment and recombination between X and Y
since the human-chimp split. Therefore, it is conceivable that the elevated bias in the
p arm of the X chromosome built up early in hominid evolution, but is no longer
building through much of the ~11mbp length. However, any bias on the p arm of X
is very weak when compared to that on autosomes and is barely elevated above the
majority of the X chromosome, which theoretically has not been subjected to the
influences of male recombination. There is no corresponding evidence of elevated
bias in the PAR region of chimp chromosome X (Figure 20). This could be due to
BGC not occurring in the PAR region or to an early rearrangement of chromosome Y
in chimp evolution. While the question of whether male recombination is the cause
of accumulated bias may be resolved soon, the prospect of using the ambiguous
100
UBCS signal seen in the X chromosome to chart the evolution of the PAR is not so
promising.
4.5 The Gamble of Male Meiosis
The picture of biased clustered substitutions is coming into focus. The model is male
recombination events leading to biased gene conversion which creates a pressure
towards fixing clusters of biased substitutions into the genome. However, the reason
why male but not female recombination might result in BGC, is not clear. A
component of the recombination repair mechanism may be Y linked, or BGC could
be the result of an X linked recessive mutation, or protection from BGC could require
a haplo-insufficient X linked enzyme. Perhaps the expense of rapid male gamete
production precludes the luxury of careful replication. For BGC is not careful
replication, and that suggests the biggest mystery of the ubiquitous bias found in this
study. In a system that results in as few replication errors as one base pair in a
billion[75], how can such sloppiness be tolerated? The creation of novel biased
clusters should be more harmful than beneficial on average. If evolution has created
the mechanism to avoid BGC in females, why isn’t it used in males?
It is well known that mutation rates are higher in males[70, 76, 77], even while
recombination rates are higher in females.[37] That oogenesis should be more
conservative while spermatogenesis less careful appears consistent with the
predictions of sexual selection and parental investment theory.[78] The probability that
101
an individual ovum will be successful is far larger than the probability for a male gamete.
But the reason for the higher rate of mutations in males, sometimes referred to as “male
driven evolution”[76], is most frequently said to be due to the larger number of cell divisions
occurring in the germ cell lines of males. However, other evidence suggests that the
difference in germ cell generation alone does not explain the male bias in mutation
rates[79, 80, 81], and part of this discrepancy may be due to sloppy recombination. It has
even been argued that sexual selection encourages higher male mutation rates which,
in turn increase the consequences of sexual selection.[82] But something doesn’t add
up when explaining error prone recombination by male parental investment. Even if
the vast majority of male gametes do not result in offspring, the lucky gamete that
does may still be harmed by a single mutation. It would seem that accurate
replication should be as important in males as in females!
However, while male replication may result in some larger percentage of mutant
gametes, those gametes that arrive at the egg may not be so disadvantaged.[83] It is
possible that sperm competition or intra-mating pair “sperm selection” is the crucible
that tests the newly replicated genomes, and weeds out the more disastrous biased
clusters. Thus, the expense of fidelity in replication is not paid, but the cost is fewer
viable sperm. This trade off still does not make obvious sense. However, if energy is
wasted creating some larger percentage of defective male gametes, it may also be the
case that some minute but still increased percentage of gametes may hold beneficial
mutations. Thus, the sloppiness of BGC may be tolerated in males because gamete
102
competition and selection allows gambling with the genome. Perhaps the majority of
biased clusters created in male replication will be harmless, while occasionally, a
biased cluster will be harmful and the gamete will not survive. However, in a tiny
fraction of the rolls of the dice, a biased cluster will create some more effective
enzyme which will result in a more efficient sperm, which will result in a big payoff.
Under this model, sexual reproduction runs an evolutionary experiment during every
fertilization! The model of male based biased gene conversion as a gamble predicts
that there should be more UBCS evidence in species where males create vast numbers
of gametes which are tested by sperm selection or competition. Biased clusters of
substitutions and high male substitution rates might be especially prevalent in species
where a few males produce a majority of offspring, since the winnings of the
reproductive gamble might be even larger, while the losses less significant. Evidence
supporting this can be found in the mutation rates of different species of birds.[82]
Higher mutation rates and even biased clusters might also be found as a female
reproductive strategy in species which produce vast numbers of eggs. On the other
hand, species with a long history of monogamous mating and with few offspring
might be expected to show less evidence of biased clusters of substitutions.
There is a real problem lurking in this theory of why BGC might be tolerated in
males. Unless novel biased clusters are tested before fertilization, there is still an
inherent danger in sloppy replication. If a new biased cluster destroys a gene that is
not even used until fetal brain development, the gamble of male recombination would
103
result if a very costly loss. However, BGC creates novel biased clusters at
recombination points. There is some evidence that recombination occurs where
transcription is in progress.[84, 85, 86] If this were so, then biased clusters of
substitutions should predominantly arise within transcribed regions of the human
genome, which is seen in the top 200 most biased regions examined in this study.
Supporting this model is the finding that genes appear to lead the accumulation of
G+C in G+C rich isochores.[11] However, such a model would predict that biased
clusters of substitutions should be found predominantly in areas transcribed in male
gamete production. While this has not been shown, there is evidence that widely
expressed housekeeping genes are more often found in G+C rich isochores[87] than are
tissue specific genes. Taken together, these bits of evidence provide a consistent set
of hints as to where to look for the process that results in biased clusters of
substitutions and how that process is tolerated.
4.6 A Thumb on the Scales of Evolution
Perhaps the most intriguing implication of the force causing biased clusters of
mutations to be fixed in the genome, is that it acts like non-Darwinian selection.
Darwin’s theory predicts that relative fitness among phenotypes leads to the selection
that drives evolution, and there is abundant evidence that fitness selection is the
overwhelmingly dominant force in functional evolution. Additionally, Darwin
understood that mate preference can also lead to evolution, in the process known as
sexual selection. However, biased clusters of mutations are being “selected”[27, 12]
104
and fixed in the genome in a wide spread pattern that is not easily explained by either
the relative fitness or sexual selection of the individual clusters. Darwinian selection
may tolerate sloppy replication; and relative fitness and sexual selection may
ultimately explain the male reproductive gamble. But most biased cluster alleles are
probably not individually selected by fitness or preference.
The theory of neutral evolution[88] is usually stated as evolution without selective
pressure. It is fueled by the many random mutations and the stochastic process that
fixes some of them into the genome. Neutral evolution can touch phenotype, but as
soon as the change is non-neutral, Darwinian selection supercedes. But the pressure
of biased gene conversion behaves just like neutral evolution by selection. Many
clusters are being “selected” even though they have no phenotypic effect.
Occasionally biased clusters will have a phenotypic effect; and while many of those
changes will be neutral, clearly some will not be. Therefore, the BGC “selection”
pressure may speed positive selection and compete with negative selection. It is even
possible that a mildly negative phenotype will be fixed into the genome by BGC,
more quickly than purifying selection can remove it.
The implication is that not all phenotypic changes are meaningful, even when they are
non-neutral. Perhaps the protruding nose of humans is neither functional nor more
attractive in mate selection. Indeed, it is quite possible that many of the traits that
make us most human were made possible through “neutral evolution by selection”.
105
For instance, a greater vocal range may have appeared by accident and been fixed
into the evolving species due to no other reason that it was the result of a biased
cluster forged in a recombination hot spot. But that greater vocal range was a
necessary precondition for the later evolution of spoken language. In another
example, perhaps a single biased cluster originally triggered enlargement of our
brains; and this change was fixed through BGC pressure, even though it resulted in
more difficult childbirth. The advantages of increased symbolic and temporal thought
may have only arrived after additional changes built upon the foundation of a larger
brain. If the model of “BGC selection” is correct, then studying biased clustered
substitutions may prove to be a very useful tool in distinguishing favorable vs.
tolerated changes to a genome. While both types of changes have made us human,
recognizing the positively selected changes may tell us more about why we became
human. And in one more way, the footsteps of past recombination may allow us to
better understand the sequence of events that have lead us to where we are today.
106
5.0 Conclusion
The model revealed by studying biased clusters of substitutions is captivating. Even
at some distance it can be seen that the force that results in weak to strong
substitutions is counterbalancing the general background bias of strong to weak
mutations. Looking more closely, the force acts predominantly near telomeres and
this may explain isochore creation and why many isochores appear to be loosing
ground today. Circling to a different perspective, it may be possible to discern and
date chromosomal rearrangements by measuring the accumulation of weak to strong
bias. Take another step closer, and the motive force creating biased clusters of
substitutions appears to be found in the reproductive strategy of males, and may
ultimately provide insight into the mystery of sexual reproduction. Finally, when we
lean close and look at the very chisel marks of biased clustered substitutions, the
force that chips them into the genome may be recognized as a non-Darwinian
“selection pressure” fueling neutral evolution which has surely helped sculpt humans,
chimpanzees and no doubt a vast number of other species.
107
Bibliography
1 The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee
genome and comparison with the human genome. Nature. 437:69-87. September 1, 2005. http://www.nature.com/nature/journal/v437/n7055/full/nature04072.html.
2 Pollard K.S., Salama S.R., King B., Kern A., Dreszer T., Katzman S., Siepel A., Pedersen J., Bejerano G., Baertsch R., Rosenbloom K.R., Kent J. and Haussler D. Forces Shaping the Fastest Evolving Regions in the Human Genome. PLoS Genetics. 2(10):e168 Oct. 13, 2006. http://genetics.plosjournals.org/perlserv/?request=get-document&doi=10.1371%2Fjournal.pgen.0020168
3 Pollard K.S., Salama S.R., Lambert N., Lambot M.A., Coppens S., Pedersen J.S., Katzman S., King B., Onodera C., Siepel A., Kern A.D., Dehay C., Igel H., Ares, Jr M., Vanderhaeghen P. and Haussler D. An RNA gene expressed during cortical development evolved rapidly in humans. Nature advance online publication 16 August 2006. http://www.nature.com/nature/journal/vaop/ncurrent/abs/nature05113.html.
4 Hisaji Maki. ORIGINS OF SPONTANEOUS MUTATIONS: Specificity and Directionality of Base-Substitution, Frameshift, and Sequence-Substitution Mutageneses. Annu. Rev. Genet. 36:279–303. 2002. http://arjournals.annualreviews.org/doi/pdf/10.1146/annurev.genet.36.042602.094806?cookieSet=1
5 Eyre-Walker A. Evidence of Selection on Silent Site Base Composition in Mammals: Potential Implications for the Evolution of Isochores and Junk DNA. Genetics, 152:675-683, June 1999. http://www.genetics.org/cgi/content/abstract/152/2/675.
6 Lipatov M., Arndt P.F., Hwa T., Petrov D.A. A Novel Method Distinguishes Between Mutation Rates and Fixation Biases in Patterns of Single-Nucleotide Substitution. J Mol Evol. 62:168–175. 2006. http://matisse.ucsd.edu/~hwa/pub/lipatov06.pdf.
7 Filipski, J., Thiery, J. P. & Bernardi, G. An analysis of the bovine genome by Cs2SO4-Ag+ density centrifugation. J. Mol. Biol. 80:177–197 1973. http://content.febsjournal.org/cgi/content/abstract/84/1/179.
8 Bernardi G. The human genome: organization and evolutionary history. Annu Rev Genet. 29:445-76. 1995. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=8825483&dopt=Abstract
9 Bernardi G Isochores and the evolutionary genomics of vertebrates. Gene 241(1):3-17. January 4, 2000. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10607893&dopt=Citation.
10 Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F. The mosaic genome of warm-blooded vertebrates. Science. 228(4702):953-8. May 24, 1985. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=4001930&dopt=Abstract.
11 Press W.H. and Robins H. Isochores as Extreme Cases of Genes Cooperatively Reshaping the Large-scale Genomic Environment. preprint submitted (submitted November 2005). http://www.lanl.gov/DLDSTP/biopreprint/draft2.pdf.
12 Eyre-Walker A. and Hurst L.D. The evolution of isochores. Nat Rev Genet. 2(7):549-55. July 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11433361&dopt=Abstract.
108
13 Wolfe K.H., Sharp P.M. & Li W. Mutation rates differ among regions of the mammalian
genome. Nature 337:283-285 January 19, 1989. http://www.nature.com/nature/journal/v337/n6204/abs/337283a0.html.
14 Fryxell K,J, and Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol. Biol. And Evol. 17(9):1371-1383. Sept. 2000. http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WVB-45DHT6H-26W&_user=4428&_coverDate=01%2F01%2F1900&_fmt=summary&_orig=search&_qd=1&_cdi=7098&view=c&_acct=C000059601&_version=1&_urlVersion=0&_userid=4428&md5=04e42a1dd31fcfbff7a21c0a98e.
15 Hurst LD, Merchant AR. High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes. Proc Biol Sci. 268(1466):493-7. Mar 7 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11296861&dopt=Abstract
16 Lercher M.J., Urrutia A.O., Pavlícek A. and Hurst L.D. A unification of mosaic structures in the human genome. Hum. Mol. Genetics, 12(19):2411-2415. 2003. http://hmg.oxfordjournals.org/cgi/content/abstract/12/19/2411.
17 Sémon M., Mouchiroud D. and Duret L. Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance. Hum. Mol. Genetics 14(3):421-427 2005. http://hmg.oxfordjournals.org/cgi/content/abstract/14/3/421.
18 Kudla G., Lipinski L., Caffin F., Helwak A. and Zylicz M. High Guanine and Cytosine Content Increases mRNA Levels in Mammalian Cells. PLoS Biol. 4(6):e18-. May 23, 2006. http://biology.plosjournals.org/perlserv?request=get-document&doi=10.1371/journal.pbio.0040180.
19 Eyre-Walker A. Recombination and mammalian genome evolution. Proc Biol Sci. June 22 1993. 252(1335):237-43. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8394585&dopt=Abstract.
20 Brown, T. C., and J. Jiricny. Different base/base mispairs are corrected with different efficiencies and specificities in monkey kidney cells. Cell 54:705–711. 1988. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=pubmed.
21 Webster MT, Smith NG. Fixation biases affecting human SNPs. Trends Genet. 20(3):122-6. Mar. 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15049304&query_hl=20.
22 Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science. 310(5746):321-4. Oct. 14 2005. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?itool=abstractplus&db=pubmed&cmd=Retrieve&dopt=abstractplus&list_uids=16224025.
23 Winckler W., Myers S.R., Richter D.J., Onofrio R.C., McDonald G.J., Bontrop R.E., McVean G.A.T., Gabriel S.B., Reich D., Donnelly P., Altshuler D. Comparison of Fine-Scale Recombination Rates in Humans and Chimpanzees. Science 308(5718):107-111. April 1, 2005. http://www.sciencemag.org/cgi/content/full/308/5718/107.
24 Pineda-Krch M. and Redfield R.J. Persistence and Loss of Meiotic Recombination Hotspots. Genetics, 169:2319-2333, April 2005. http://www.genetics.org/cgi/content/full/169/4/2319.
25 Meunier, J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21:984–990. 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=14963104&query_hl=6&itool=pubmed_docsum.
109
26 Birdsell J.A. Integrating Genomics, Bioinformatics, and Classical Genetics to Study the
Effects of Recombination on Genome Evolution. Molecular Biology and Evolution 19:1181-1197. 2002. http://mbe.oxfordjournals.org/cgi/content/full/19/7/1181.
27 Nagylaki T. Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA. 80(20):6278–6281. October 1983. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=394279&tools=bot.
28 Jeffreys A.J. and Neumann R. Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nature Genetics 31:267–271. 2002. http://www.nature.com/ng/journal/v31/n3/full/ng910.html.
29 Kent, W.J., Sugnet, C. W., Furey, T. S., Roskin, K.M., Pringle, T. H., Zahler, A. M., and Haussler, D. The human genome browser at UCSC. Genome Res. 12(6), 996-1006. 2002. http://www.genome.org/cgi/content/abstract/12/6/996.
30 Karolchik, D. and Kent, W.J. The UCSC Genome Browser. Current Protocols in Bioinformatics (ed. Baxevanis, A.D.) (John Wiley & Sons, Inc., 2002).
31 Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., Weber, R.J., Haussler, D. and Kent, W.J. The UCSC Genome Browser Database. Nucl. Acids Res. 31(1):51-54. 2003. http://nar.oxfordjournals.org/cgi/content/abstract/31/1/51.
32 The R Foundation for Statistical Computing. 2006. http://www.r-project.org/index.html. 33 Macaque Genome Sequencing Consortium led by the Baylor College of Medicine Human
Genome Sequencing Center, in collaboration with the J. Craig Venter Institute Joint Technology Center, and the Genome Sequencing Center at Washington University School of Medicine, St. Louis. January 2005. http://www.hgsc.bcm.tmc.edu/projects/rmacaque/.
34 Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature, 420:520-562. 2002. http://www.nature.com/nature/mousegenome/.
35 Rat Genome Sequencing Project Consortium. Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution. Nature, 428:493–521, 2004. http://www.nature.com/nature/journal/v428/n6982/abs/nature02426_fs.html.
36 The International HapMap Consortium. A haplotype map of the human genome. Nature 437:1299-1320. 2005. http://www.hapmap.org/downloads/presentations/Nature_HapMap_phaseI.pdf.
37 Kong, A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., Palsson, S.T., Frigge, M.L., Thorgeirsson, T.E., Gulcher, J.R., and Stefansson, K. A high-resolution recombination map of the human genome, Nature Genetics, 31(3):241-247. 2002. http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v31/n3/abs/ng917.html.
38 Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res. 32:D115-D119. 2004. http://www.expasy.uniprot.org/about/publications.shtml.
39 The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Oct. 2002. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.
40 Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J. and Wheeler D.L. GenBank. Nucleic Acids Research, Vol. 33, Database issue Oxford University Press 2005. http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D34.
41 Benson D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. GenBank: update. Nucleic Acids Res. 32, D23-6. 2004. http://nar.oxfordjournals.org/cgi/content/full/32/suppl_1/D23.
110
42 Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron:
Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20):11484-11489. 2003. http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=77835828&c=chr2&g=chainSelf.
43 Smit A.F.A., Hubley R. and Green P. RepeatMasker Open-3.0. 1996-2004 http://www.repeatmasker.org/.
44 Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034-1050. 2005. http://www.genome.org/cgi/content/full/15/8/1034.
45 Bussell J.J., Pearson N.M., Kanda R., Filatov D.A. and Lahn B.T. Human polymorphism and human–chimpanzee divergence in pseudoautosomal region correlate with local recombination rate. Gene. 368:94-100. March 1, 2006. http://www.sciencedirect.com/science/article/B6T39-4HSY4TN-2/2/d4fa698542e4dbc8268fe00b7c71d882.
46 Strathern J.N.; Shafer B.K.; McGill C.B. DNA synthesis errors associated with double-strand-break repair. Genetics. 140(3):965-972. 1995. http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6T1G-3P16SDR-2J5&_user=4428&_coverDate=01%2F01%2F1900&_fmt=summary&_orig=search&_qd=1&_cdi=4890&view=c&_acct=C000059601&_version=1&_urlVersion=0&_userid=4428&md5=08c335a83b53abae471c11f2c19.
47 Ijdo, J., Baldini, A., Ward, D.C., Reeders, S.T., and Wells, R.A. Origin of human chromosome 2: An ancestral telomere–telomere fusion. Proc. Natl. Acad. Sci. 88: 9051–9055. October 1991. http://www.pnas.org/cgi/content/abstract/88/20/9051.
48 Navarro A. and Barton N.H.. Chromosomal Speciation and Molecular Divergence--Accelerated Evolution in Rearranged Chromosomes. Science 300(5617):321-324. April 11, 2003. http://www.sciencemag.org/cgi/content/full/300/5617/321?ijkey=96440a7ada6aaf7f0e287d198953db7c1a2575a0.
49 Yu A., Zhao C., Fan Y., Jang W, Mungall A.J., Deloukas P., Olsen A., Doggett A., Ghebranious N., Broman K.W. and Weber J.L. Comparison of human genetic and sequence-based physical maps. Nature 409:951-953 February 15, 2001. http://www.nature.com/nature/journal/v409/n6822/full/409951a0.html.
50 Webster M.T., Smith N.G.C., Hultin-Rosenberg L., Arndt P.F. and Ellegren H. Male-Driven Biased Gene Conversion Governs the Evolution of Base Composition in Human Alu Repeats. Mol. Biol. and Evol. 22(6):1468-1474. 2005. http://mbe.oxfordjournals.org/cgi/content/abstract/22/6/1468.
51 Burge, C. Modeling Dependencies in Pre-mRNA Splicing Signals. In Salzberg, S., Searls, D., and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, 127-163. 1998. GenScan tool: http://genes.mit.edu/GENSCAN.html.
52 Webb J.C., Patel D.D., Jones M.D., Knight B.L., Soutar A.K. Characterization and tissue-specific expression of the human 'very low density lipoprotein (VLDL) receptor' mRNA. Hum. Mol. Genet. 3:531-537. 1994. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8069294&dopt=Abstract.
53 Chen T., Ajami K., McCaughan G.W., Gorrell M.D., Abbott C.A. Dipeptidyl peptidase IV gene family. The DPIV family. Adv. Exp. Med. Biol. 524:79-86(2003). http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12675227&dopt=Abstract.
111
54 Wei M.-H., Karavanova I., Ivanov S.V., Popescu N.C., Keck C.L., Pack S., Eisen J.A.,
Lerman M.I. In silico-initiated cloning and molecular characterization of a novel human member of the L1 gene family of neural cell adhesion molecules. Hum. Genet. 103:355-364. 1998. http://www.springerlink.com/content/10h0ucut05mm3acp/.
55 Matthes H, Boschert U, Amlaiky N, Grailhe R, Plassat JL, Muscatelli F, Mattei MG, Hen R. Mouse 5-hydroxytryptamine5A and 5-hydroxytryptamine5B receptors define a new family of serotonin receptors: cloning, functional expression, and chromosomal localization. Mol Pharmacol. 43(3):313-9. March 1993. http://molpharm.aspetjournals.org/cgi/reprint/43/3/313
56 Grailhe R., Grabtree G.W. and Hen R. Human 5-HT5 receptors: the 5-HT5A receptor is functional but the 5-HT5B receptor was lost during mammalian evolution. Eur J Pharmacol. 418(3):157-67. Apr. 27, 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=Abstract&list_uids=11343685&query_hl=2&itool=pubmed_docsum
57 Birney E., Clamp M. and Durbin R. GeneWise and Genomewise. Genome Research 14:988-995. 2004. Tool: http://www.ebi.ac.uk/Wise2/index.html http://www.genome.org/cgi/content/abstract/14/5/988.
58 Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Bryant SH, CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33(Database Issue):D192-6. 2005. Tool:http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D192.
59 Gomez-Jeria JS, Morales-Lagos DR. Quantum chemical approach to the relationship between molecular structure and serotonin receptor binding affinity. J. Pharm Sci. 73(12):1725-8. Dec. 1984. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=6527244&dopt=Abstract
60 Fan Y., Linardopoulou E., Friedman C., Williams E. and Trask B.J. Genomic Structure and Evolution of the Ancestral Chromosome Fusion Site in 2q13–2q14.1 and Paralogous Regions on Other Human Chromosomes. Genome Res. 12(11):1651-1662. Nov. 2002.
http://www.genome.org/cgi/content/abstract/12/11/1651 61 Kim E, Cho KO, Rothschild A, Sheng M. Heteromultimerization and NMDA receptor-
clustering activity of Chapsyn-110, a member of the PSD-95 family of proteins. Neuron. 17(1):103-13. July 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8755482&dopt=Abstract.
62 Kim E., Cho K.O., Rothschild A. and Sheng M. Heteromultimerization and NMDA receptor-clustering activity of Chapsyn-110, a member of the PSD-95 family of proteins. Neuron. 17(1):103-13. July 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8755482&dopt=Abstract.
63 Player A., Gillespie J., Fujii T., Fukuoka J., Dracheva T., Meerzaman D., Hong K.M., Curran J., Attoh G., Travis W. and Jen J. Identification of TDE2 gene and its expression in non-small cell lung cancer. Int. J. Cancer. 107(2):238-43. Nov. 1, 2003.
64 Pruitt K.D., Tatusova, T., Maglott D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33(1):D501-D504 2005. http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D501?ijkey=06NkMzN3kUcez&keytype=ref.
112
65 Zhang M., Yu L., Wu Q., Zheng L.H., Wei Y.H., Wan B., Zhao S.Y. Identification and
characterization of TDE2, a plasma-membrane protein with 11 transmembrane helices, and its variable expression in human lung cancer and liver cancer tissues. Submitted (JUL-2003) to the EMBL/GenBank/DDBJ databases.
66 Scholnick S.B., Richter T.M. The role of CSMD1 in head and neck carcinogenesis. Genes Chromosomes Cancer 38(3):281-283. 2003.
67 Belle E.M.S., Duret L., Galtier N. and Eyre-Walker A. The Decline of Isochores in Mammals: An Assessment of the GC Content Variation Along the Mammalian Phylogeny. J Mol Evol. 58(6):653-60. June 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=15461422&dopt=Citation
68 Kumar S., Filipski A., Swarna V., Walker A., and Hedges S.B. Placing confidence limits on the molecular age of the human–chimpanzee divergence. Proc. Natl. Acad. Sci. USA. 102(52):18842-7. Dec 27, 2005. http://www.pnas.org/cgi/content/full/102/52/18842.
69 McVean G.T. & Hurst L.D. Evidence for a selectively favourable reduction in the mutation rate of the X chromosome. Nature 386:388-392 March 17, 1997. http://www.nature.com/nature/journal/v386/n6623/abs/386388a0.html;jsessionid=D8FE9B08A8DD0129F0257A7AB3C59055.
70 Goetting-Minesky MP, Makova KD. Mammalian Male Mutation Bias: Impacts of Generation Time and Regional Variation in Substitution Rates. J Mol Evol. [Epub ahead of print] 2006 Sep 4. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16955237.
71 Graves J.A.M, Wakefield M.J. and Toder R. The Origin and Evolution of the Pseudoautosomal Regions of Human Sex Chromosomes. Hum. Mol. Genet. 7(13):1991-6. Dec. 1998. http://hmg.oxfordjournals.org/cgi/content/full/7/13/1991.
72 Aitken, R.J. and Graves, J.A.M. 2002. Human spermatozoa: The future of sex. Nature 415:963. February 28, 2002. http://www.nature.com/nature/journal/v415/n6875/full/415963a.html.
73 Hughes J.F., Skaletsky H., Pyntikova T., Minx P.J., Graves T., Rozen S., Wilson R.K. and Page D. C. Conservation of Y-linked genes during human evolution revealed by comparative sequencing in chimpanzee. Nature 437:100-103 September 1, 2005. http://www.nature.com/nature/journal/v437/n7055/full/nature04101.html.
74 Kuroki Y., Toyoda A., Noguchi H., Taylor T.D., Itoh T., Kim D., Kim D., Choi S., Kim I., Choi H.H., Kim Y.S., Satta Y., Saitou N., Yamada T., Morishita S., Hattori M., Sakaki Y., Park H. and Fujiyama A. Comparative analysis of chimpanzee and human Y chromosomes unveils complex evolutionary pathway. Nature Genetics. 38:158-167 2006. http://www.nature.com/ng/journal/v38/n2/full/ng1729.html.
75 Alberts B., Bray D., Lewis J., Raff M., Roberts K. and Watson J.D. Molecular Biology of the Cell. Garland Publishing, inc. New York and London. 1983. p106.
76 Li WH, Yi S, Makova K. Male-driven evolution. Curr. Opin. Genet. Dev. 12(6):650-6. Dec. 2002. Li WH, Yi S, Makova K. Male-driven evolution. Curr. Opin. Genet. Dev. 12(6):650-6. Dec. 2002.
77 Crow JF. How much do we know about spontaneous human mutation rates? Environ. Mol. Mutagen. 21(4):389. 1993. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=8444142&dopt=Abstract.
78 Trivers R. Social Evolution Benjemin Cummings Publishing Company, Inc. Meno Park California. 1985.
113
79 Lercher M.J., Williams E.J., Hurst L.D. Local similarity in evolutionary rates extends over
whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Mol Biol Evol. 18(11):2032-9. Nov. 2001. http://mbe.oxfordjournals.org/cgi/content/full/18/11/2032.
80 Filatov D.A., Charlesworth D. Substitution rates in the X- and Y-linked genes of the plants, Silene latifolia and S. dioica. Mol. Biol. Evol. 19(6):898-907. June 2002. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12032246.
81 Gaffney D.J., Keightley P.D. The scale of mutational variation in the murid genome. Genome Res. 15(8):1086-94. Epub 2005 Jul 15. Aug. 2005. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16024822.
82 Møller A.P. and Cuervo J.J. Sexual selection, germline mutation rate and sperm competition. BMC Evol. Biol. 3:6. April 2003. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=156621.
83 Holt W.V., Van Look K.J. Concepts in sperm heterogeneity, sperm selection and sperm competition as biological foundations for laboratory tests of semen quality. Reproduction. 127(5):527-35. May 2004. http://www.reproduction-online.org/cgi/content/full/127/5/527.
84 Prado F., Piruat J.I. and Aguilera A. Recombination between DNA repeats in yeast hpr1Delta cells is linked to transcription elongation. The EMBO Journal 16:2826–2835. 1997. http://www.nature.com/emboj/journal/v16/n10/abs/7590265a.html.
85 Aguilera A., The connection between transcription and genomic instability. EMBO J. 21:195–201 (2002). http://www.nature.com/emboj/journal/v21/n3/full/7594240a.html.
86 Mieczkowski P.A., Dominska M., Buck M.J., Gerton J.L., Lieb J.D. and Petes T.D. Global analysis of the relationship between the binding of the Bas1p transcription factor and meiosis-specific double-strand DNA breaks in Saccharomyces cerevisiae. Mol. Cell Biol. 26(3):1014-27. Feb. 2006. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=16428454&query_hl=26&itool=pubmed_docsum.
87 Vinogradov A.E. Isochores and tissue-specificity. Nucleic Acids Res. 31(17):5212–5220. Sept. 1, 2003. http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=12930973.
88 Takahata N. Neutral theory of molecular evolution. Curr. Opin. Genet. Dev. 6(6):767-72. Dec. 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=8994850&query_hl=30&itool=pubmed_DocSum.