View
223
Download
3
Category
Tags:
Preview:
Citation preview
A Bayesian approach to inferring recent selective sweeps in West African
Anopholes gambiae populations
John Marshall1, Professor Robert Weiss2
1Department of Biomathematics, UCLA School of Medicine, Los Angeles CA 90095-1766 USA
2Department of Biostatistics, UCLA School of Public Health,Los Angeles CA 90095-1772 USA
Using microsatellite alleles to detect recent selective sweeps
Microsatellites:
– Tandem repeats of short DNA segments typically 1-5 bp in length
– Alleles defined by number of repeats at a particular locus
– Multiallelic → highly informative markers
Factors affecting variance in microsatellite allele size:
1. Locus specific:
• Microsatellite mutation rate (mainly due to ‘slippage’ during DNA replication)
2. Population specific:
• Effective population size
• Population-level events (migration, bottlenecks)
3. Population and locus specific:
• Hitchhiking of a microsatellite allele to a selected gene
The lnRV statistic
222
2
21
2
1
21lnlnlnln jijijii
ji
jiRV
jeij iNE 4][ 2
21
2
1
2
1 lnln4
4ln
][
][ln
2
2
ii
i
iji
eeje
je
ji
NNN
N
E
E
•From population genetics, variance in microsatellite allele size at a given locus (j) in a given population (i) is a function of effective population size (Nei) and microsatellite mutation rate (j)
•Taking the ratio of expected variances in microsatellite allele sizes for a pair of populations (i1 and i2) thus removes the locus-dependence
•For a pair of populations (i1 and i2) the ratio of variances for a set of loci (j=1,2,…,T) can be calculated
•Using coalescent simulations, the lnRV values have empirically been shown to follow a normal distribution.
•A microsatellite near to a selected locus is expected to have reduced variance and hence to have an lnRV value that is an outlier from the otherwise normal distribution of lnRV values
Pros and cons of the lnRV statistic
PROS:– Easy and fast to calculate– Intuitive to understand– Can cope with a very large
number of loci– Not sensitive to genetic drift,
migration or inbreeding since these processes affect all loci to the same extent and so are removed in the ratio calculation
CONS:
– Much information is lost when a set of allele size data at a particular locus for all individuals in a population is reduced to a single value
– Only makes pair-wise comparisons
– Difficult to extrapolate methodology to >2 populations
– Inferences from pairs of populations are not carried over to other populations
– Masking can occur when multiple outliers expand the confidence interval and lead to none or only a subset of outliers being detected
The Bayesian model
),(~ 2ijijijk Ny
),(~
),(~
),(~
),0(~
),0(~
),0(~
..
..
..
0
ba
ba
ba
ij
j
i
ijjiij
Gamma
Gamma
Gamma
N
N
N
),(~
),(~
),(~
),0(~
),0(~
),0(~
ln
..
..
..
02
ba
ba
ba
ij
j
i
ijjiij
Gamma
Gamma
Gamma
N
N
N
Mean components: Variance components:
Distribution of microsatellite allele sizes:
(i indexes population, j indexes locus, k indexes individual)
Consistency between lnRV statistic and Bayesian ANOVA
)()(ln
),0(~
),0(~
),0(~
ln
2121
02
jijiii
ij
j
i
ijjiij
RV
N
N
N
ji
jiee
ij
j
e
ijjeij
s
sNNRV
Ns
N
NN
sNE
i
i
2
1
21ln)ln(lnln
),(~ln
),(~ln
),(~ln
lnlnln4ln])[ln( 2
ji
jijiji s
s
2
1
21ln
Bayesian ANOVA: lnRV statistic:
Relative selection:
Bayesian statistics for detecting selective sweeps
)|Pr(max
yijji
)|(max
yE ijji
jimax For a given locus j, the population with the smallest fractional
reduction in allele size variance is denoted imax and has this corresponding variance component.
• Here BnM has the largest value so is least selected
• BnB and SeB have the smallest values so are most selected
• The extent of selection can be measured by:
• And:
Relative selection at locus j can be measured relative to population imax, e.g.:
Pros and cons of Bayesian approach
PROS:– Doesn’t shrink data down to
summary statistics before analysis
– Can be used to compare >2 populations at once
– Inferences from one population are carried over to all others
– Can cope with any number of selected loci without shielding occurring
– Supplies quantitative measures of the probability that selection has occurred
– Can cope well with tiny sample sizes
CONS:– Can take a long time to converge– Sometimes requires a lot of computer
power– Bayesian methods are more difficult to
implement• Require well-specified prior
distributions• Require programming, use of
complicated software– Inferences are slightly determined by
subjective choice of prior distributions
Microsatellite data for West African Anopholes gambiae populations
• 1998 data set: – Allele size data collected at 21
microsatellite loci dispersed throughout Anopholes gambiae
– 5 subpopulations:• Bamako chromosomal form in villages
of Banambani and Selinkenyi• Mopti chromosomal form in villages of
Banambani and Selinkenyi• Savannah chromosomal form in village
of Banambani
• 2003 data set:– Microsatellite allele size data collected at
12 microsatellite loci dispersed throughout Anopholes gambiae chromosome 3
– Data taken for 12 subpopulations• Mopti chromosomal form in the villages
of Oure, Dire, Kondi, Nampala, Torkya and Banikane
• Savannah chromosomal form in the villages of Oure, Gono, Kokouna, Pimperena, Soulouba and Madina Diasra
Loci likely targeted by recent selective sweeps (1998 data set)
|025
637|
Locus Chromosome Chromosomal form
Location
1 637 2L Bamako Banambani & Selinkenyi
2 038 X Savannah Banambani
3 135 2 Mopti Banambani & Selinkenyi
4 079 2 Mopti Banambani & Selinkenyi
5 175 2 Mopti Banambani & Selinkenyi
6 095 2R Mopti Banambani & Selinkenyi
7 025 X Savannah Banambani
Applying the Bayesian ANOVA model to the 1998 data set, there is evidence of selection (in order of magnitude) in:
Locus 637: 637
/
Loci likely targeted by recent selective sweeps (2003 data set)
Locus Chromosome Chromosomal form
Location
1 119 3R Mopti Oure
2 127 3R Savannah Oure
3 093 3L Mopti Kondi & Banikane
Savannah Gono & Kokouna
4 812 3 Mopti Nampala & Dire
5 817 3L Savannah Soulouba
6 555 3 Savannah Madina Daisra
Applying the Bayesian ANOVA model to the 2003 data set, there is evidence of selection (in order of magnitude) in:
Locus 119:119
|
Implications for recent selection in Anopholes gambiae genome
119|
-577-093
059-
1998 data set:– Strongest evidence for selection
is for: • locus 637 (chromosome 2) in
Bamako form
• locus 038 (X chromosome) in Savannah form
– Most selected loci are on chromosome 2
– For a given chromosomal form collected at Banambani and Selenkenyi, selection seems to be evident in both locations
– The same does not apply for a given location where multiple chromosomal forms are collected
– Suggests there is more gene flow between these two villages than there is between chromosomal forms
2003 data set:– Strongest evidence for
selection is for:• locus 119 (chromosome 3R)
in Mopti form in Oure
• Locus 127 (chromosome 3R) in Savannah form in Oure
– Selected loci are dispersed throughout chromosome 3 (only chromosome 3 loci were analyzed in this data set)
– This time there is very little correlation for given chromosomal forms collected at neighbouring locations
– Possibly selection on chromosome 3 is weaker (1998 data set showed no selection on chromosome 3)
Recommended