Upload
michael-blum
View
520
Download
9
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
When Bayes meets Darwin: a journey in popula6on genomics
Laboratoire TIMC-‐IMAG, Grenoble
In the “descent of man”, Darwin concluded that the visual differences between human popula6on were not adap6ve to any significant degree […]
“Natural selec,on has almost become irrelevant in human evolu,on. There's been no biological change in humans in 40,000 or 50,000 years” Stephen J. Gould
But here is a counter-‐example • Tibetan popula6ons got adapted to their high-‐al6tude and
low-‐oxygen environment thanks to increased respiratory rate and increased blood flow.
• These traits are transmiTed from genera6on to genera6on.
• Tibetan plateau has been inhabited since ~ 20,000 years.
Local adapta6on • Human adapta6on to high-‐al6tude is an instance of local
adapta6on. • Understanding how individuals adapt to their local
environment is central in biology. Plants adapt to their environment, bacteria adapt to an6bio6cs…
• Defini6on of local adapta6on: greater fitness (a measure of reproduc6ve fitness) of individuals in their local habitats due to natural selec6on.
How to find genomic regions involved in local adapta6on?
Data descrip6on
Single Nucleo6de Polymorphism (SNP) Indiv 1 ....ACCCG………. ....AACCG……….
Number of copy 1 0 Indiv 2 ….ACCCT………. ….ACCCT……….
Number of copy 0 2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ • 3 billion base pairs in the human genome • Commercial SNP chips, 100€ for 500,000 SNPs • dbSNP >106 SNPS
Single Nucleo6de Polymorphism (SNP)
Locus 1 Locus 2 Locus 3
Indiv 1 1 0 2
Indiv 2 0 2 0
Indiv 3 0 0 0
Indiv 4 0 1 1
Indiv 5 1 1 1
Data matrix Y
Main principle of popula6on genomics • Genome-‐wide paTerns are influenced by neutral processes. Migra6on, admixture, expansion
• Genes involved in local adapta6on are outliers.
Adapta6on to al6tude Manha?an plot
Xu et al. MBE 2011
Human HGDP data
Genome-‐wide paTerns
Principal component analysis
ï� 0 � 10 �� 20
�
01
02
03
0
PC1
PC
2
Africa
America
Oceania
Middle-East
Europe
East Asia
Asia
Principal component analysis
Novembre et al. Nature 2008
Genome scan for local adapta6on: a Bayesian PCA approach
Singular Value Decomposi6on (SVD)
viewpoint of PCA
In matrix nota6on, we have
Y =UV,where Y is the genotype (n,p) matrix, U is the (n,K) score matrix and V is the loadings (K,p) matrix. Varia6ons around SVD in machine learning matrix factoriza,on, low-‐rank approxima,on, probabilis,c PCA, factor analysis,…
Singular Value Decomposi6on (SVD)
viewpoint of PCA
An op6mal approxima6on of rank K for the matrix of genotypes Y
Yi = uikV k
k=1
K
∑
Yi: Genotype of the ith individual (0,1,1,2,0,0,…..)
Vk: vector of loadings of the same length as Yi
( k,1v , k,2v , k,3v ,...)
Bayesian principal component analysis
p(v j ) = (1−π ) Ν(0,σ 2 )+π Ν(0,c2σ 2 ),
• A probabilis6c version of PCA Tipping and Bishop 1999
• The variance-‐infla6on model for outlier detec6on Box and Tiao 1968
where π is the genome-‐wide outlier probability, and the prior for c2 is uniform(1,c2max).
Yi = uikV k
k=1
K
∑ +εi.
Accoun6ng for local correla6on in the genome
Ising model (Outlier Zj=1, non-‐outlier Zj=0) P(Z j =1)∝π exp(β. Zk
k~j∑ ),
Local correla6on because of recombina6on
where β>0 is an hyperparameter.
A hierarchical Bayesian model Gibbs sampler for sampling the posterior
Y
U V
σ
c cmax
β π
Z K
σ0
Low-‐rank approxima6on for outlier detec6on in video sequences
Bayesian scores for detec6ng outliers
BF = P(Y j outlier) / P(Y j non−outlier)
P(outlier Y j ) / P(non−outlier Y j ) = prior.odds*BF
• Bayes factors: a Bayesian alterna6ve to P-‐values
• Posterior odds
• For any list of outlier SNPs, a false discovery rate can be es6mated based on posterior odds.
Ex 1: a simula6on study in a divergence model
Neutral divergence (ms)
Divergence with selec6on (SimuPOP) 4% out of 10,000 SNPs under selec6on
Other methods for genome scan of local adapta6on
• Fst A measure of differen6a6on between popula6ons • BayeScan (Foll and Gaggios 2008) • Both methods assume (implicitely or explicitely) a mechanis6c
model of instantaneous divergence
Popula6on structure
ï�� ï� � � ��
�
�
���
3&�
PC2
Neutral
Adap6ve
Selec6on scan
0 2000 4000 6000 8000 10000
02
46
8
SNP
log1
0(BF
)
PC 1PC 2 PC 3
Comparing methods of selec6on scan
0.01 0.02 0.03 0.04 0.05
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Divergence time
Fals
e di
scov
ery
rate
BayeScanPCAdaptFst T
Advantage of non-‐parametric methods in data-‐rich situa6ons
Ex 2: a spa6ally-‐explicit simula6on
with a gradient of selec6on
0.5
0.5
0.5
0
0.5
1
1.5
2
Popula6on structure
1
0
1 1.5
1
0.5 0
0.5
1
1.5
0.5
0.5
0.5
0
0.5
1
1.5
2
PC 1 PC 2 PC 3
Selec6on scan
0 500 1000 1500 2000
050
100
150
200
250
SNP
log1
0(BF
)PC 1PC 2 PC 3
Applica6on to the human HGDP data
ï� 0 � 10 �� 20
�
01
02
03
0
PC1
PC
2
Africa
Americas
Oceania
Middle-East
Europe
East Asia
Asia
ManhaTan plot
0e+00 2e+07 4e+07 6e+07 8e+07
ï�0
�2
34
Physical position
ORJ���%)
�3&�PC2PC3PC4
ABCC11
Top hit is in chromosome 16
Geographic distribu6on of the top-‐SNP
Involved in earwax type (cerumen) and transpira6on
Enrichment analysis
ï� 0 � 10 �� 20
�
01
020
30
PC1
PC
2Africa
Americas
Oceania
Middle-East
Europe
East Asia
Asia
Are PC2 outliers enriched for genes involved in immunity?
Big data
What can you do with millions of SNPs? Scalable Bayesian computa6on?
Standard PCA and permuta6on tests.
A George Box (1919-‐2013) story to conclude
• Box wanted to write a paper with Cox because having a Box and Cox paper would be fun.
• They decided to write a paper on transforma6on. • One author wrote the Bayesian version and the other one
wrote the maximum likelihood version. We do not know who wrote what.
• At the end, it did not make much prac6cal difference.
Nicolas Duforet-‐Frebourg
Spa6al autocorrela6on explains the PCA paTern
Choice of K
2 4 6 8 10 12
0.16
00.
165
0.17
00.
175
0.18
00.
185
K
Mea
n sq
uare
d er
ror
Robustness w.r.t. the choice of K
0.01 0.02 0.03 0.04 0.05
0.0
0.2
0.4
0.6
0.8
1.0
Divergence time
Fals
e di
scov
ery
rate
K=1
K=2
K>2