Upload
paige-shakespeare
View
246
Download
1
Embed Size (px)
Citation preview
Introduction to Haplotype Estimation
Stat/Biostat 550
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
A C G C C T T T G C G C
G A A C C C C C A G G C
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
A C G C C T T T G C G C
G A A C C C C C A G G C
The Haplotype Problem
• Suppose we genotype individuals at a number of tightly linked SNPs.
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
The Haplotype Problem
• What do the types on the two chromosomes look like?
Haplotypes: who cares?
• LD mapping: increase power?
• LD mapping: decrease genotyping?
• Evolutionary studies: selection, recombination, gene conversion, population structure,…
Many people, for many different reasons…
The Haplotype Problem – potential solutions
• Molecular methods
• Collect family data
• Statistical methods for population data
The Simplest Case
• What do the types on the two chromosomes look like?
The Next Simplest Case
• What do the types on the two chromosomes look like?
The Next Simplest Case
• What do the types on the two chromosomes look like?
The first difficult case…
• What do the types on the two chromosomes look like?
The first difficult case…
• What do the types on the two chromosomes look like?
Clark’s Method (1990)
• Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.
Is it this configuration?
1
2
3
…or this one?
1
2
3
This one is more probable.
1
2
3
Clark’s Method (Clark, 1990)
• Identify the unambiguous individuals.
• Make a list of “known” haplotypes.
• Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.
Clark’s Method
List of known haps.1
2
3
Clark’s Method
List of known haps.1
2
3
Clark’s Method: Problem 1
3
1
2
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Clark’s Method: Problem 1
List of known haps.1
2
3
Answer depends on order list is considered….
… and frequency information is ignored
Clark’s Method: Problem 2
3
1
2
Clark’s Method: Problem 2
3
1
2
List of known haps.
Algorithm can fail to resolve all haplotypes…
… because looks only for exact matches
Clark’s Algorithm: Summary
• Results may depend on order individuals are considered.
• Frequency information is ignored.
• May fail to resolve all haplotypes.
• Fails to assess uncertainty.
• Looks only for exact matches.
• Fast and intuitive(?).
Maximum Likelihood (EM Algorithm)
• Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).
}21:2,1{ 211 ),...|Pr(ighhhh hhNi ffffg
),...|Pr(),...|,...,Pr( 111 Ni
iNn ffgffgg
Bayesian version
• Replace single pass through data, with iterative scheme.
• Allow for uncertainty in resolution.
• Use frequency information.
Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).
Modify Clark’s algorithm:
Example
List of known haps.1
2
3Matches 1 known
Does not match any
31
Assigned moderate probability
Example
List of known haps.1
2
3Matches 3 known
Does not match any
31
Assigned higher probability
Example
List of known haps.1
2
3Does not match any
Does not match any
31
Assigned low probability
Problems with EM/naïve Gibbs
• Potentially (very) large number of parameters to estimate, leading to inaccurate estimates.
• Can be time-consuming for large problems.
• Can “converge” to poor local optima (alleviated by multiple runs).
Further modification
• Take into account “near misses”, as well as exact matches.
(PHASE v1.0: Stephens, Smith and Donnelly 2001)
Example
List of known haps.1
2
3Matches 1 known
Differs by 2 from 3 known
31
Example
List of known haps.1
2
3Matches 3 known
Differs by 2 from 1 known
31
Example
List of known haps.1
2
3Differs by 1 from 3 known
Differs by 1 from 1 known
31
How to balance these possibilities?
The key question
• What is the conditional distribution of the next haplotype, given a set of known haplotypes?
Example
1
2
Given the above haplotypes, what would you expect the next haplotype to look like?
Qualitative answer
• The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype.
• Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.
Comparisons on simulated data
Problems
• Time-consuming for large problems.
• Can “converge” to poor local optima.
• Ignores recombination (decay of LD with distance).
• How should uncertainty in haplotype estimates be treated?
… to be continued.