View
219
Download
1
Tags:
Embed Size (px)
Citation preview
RECOMBINOMICS: Myth or Reality?
Laxmi Parida
IBM Watson Research New York, USA
2
IBM Computational Biology Center
1. Motivation
2. Reconstructability (Random Graphs Framework)
3. Reconstruction Algorithm (DSR Algorithm)
4. Conclusion
RoadMap
3
IBM Computational Biology Center
4
IBM Computational Biology Center
www.nationalgeographic.com/genographic
5
IBM Computational Biology Center
www.ibm.com/genographic
6
IBM Computational Biology Center
Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool
Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are?
Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed
phylogeographic question
7
IBM Computational Biology Center
DNA material in use under unilinear transmission
58 mill bp
0.38%
16000 bp
8
IBM Computational Biology Center
Missing information in unilinear transmissions
past
present
9
IBM Computational Biology Center
Table MountainCape Town, South Africa
10
IBM Computational Biology Center
Paradigm Shift in Locus & Analysis
Using recombining DNA sequences Why?
Nonrecombining gives a partial story
1. represents only a small part of the genome
2. behaves as a single locus
3. unilinear (exclusively male of female) transmission Recombining towards more complete information
Challenges Computationally very complex How to comprehend complex reticulations?
11
IBM Computational Biology Center
1. Motivation
2. Reconstructability (Random Graphs Framework)
3. Reconstruction Algorithm (DSR Algorithm)
4. Conclusion
RoadMap
L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework,
Under preparation.
12
IBM Computational Biology Center
GRAPH DEF:1. Infinite number of vertices
arranged in finite sized rows
2. Edges introduced via a random processacross immediate rows
PROPERTIES:Address some topological questions
1. First, identify a Probability Space2. Then, pose and address specific questions
(such as expected depth of LCA etc..)
The Random Graphs Framework
13
IBM Computational Biology Center
1. Infinite number of verticeswith a specific organization
2. Edges introduced via a random processsatisfying specific rules
3. Address some topological questions1. Define a Probability Space2. Pose and answer specific questions
(such as expected depth of LCA etc..)
The Random Graphs Framework
Wright-Fisher Model
1. Constant population 2. Non-overlapping generations3. Panmictic
14
IBM Computational Biology Center
The Random Graphs Framework
15
IBM Computational Biology Center
Properties of this Pedigree Graph
1. DAG Directed Acyclic Graph
2. |E| = O(|V|) for any finite fragment; sparse graph…Vertex-centric view..
3. Focus on the flow of genetic material: relevant pedigree graph
16
IBM Computational Biology Center
Pedigree Graph: GPG(K,N)
K no of extant units 2N population size/generation
Can the model ignore color of vertex?
17
IBM Computational Biology Center
Pedigree Graph: GPG(K,N)
K no of extant units 2N population size/generation
Can the model ignore color of vertex?
Forbidden Structure
18
IBM Computational Biology Center
Probability Space
Space is non-enumerable
Uniform probability measure?WF pop
Probability of some event F(h) for a fixed depth, h, & take limit:
19
IBM Computational Biology Center
Topological Property of GPG(K,N)
Least Common Ancestor (LCA) of ALL (K) extant vertices ------TMRCA or GMRCA-------
How many LCA’s ?
Expected Depth of the shallowest LCA
20
IBM Computational Biology Center
Infinite No. of LCA’s in a GPG(4,3) instance …..
In fact, there exist infinite such instances!
21
IBM Computational Biology Center
Topological Property of GPG(K,N)
Least Common Ancestor (LCA)------TMRCA or GMRCA-------
How many LCA’s ?
Expected Depth of the shallowest “LCA” MEASURE OF RECONSTRUCTABILITY
22
IBM Computational Biology Center
(Genetic Exchange) Sexual Reproduction vs Graph Model
Ancestor without ancestry
23
IBM Computational Biology Center
1. Graph Theoretic (topological): CA common ancestor
LCA Least CA or Shallowest CAMRCA Most Recent CATMRCA The MRCA
2. Graph Theoretic + Biology (Genetic Exchange): CAA common ancestor-&-ancestry
LCAA Least CAAGMRCA Grand MRCA
Unilinear Transmission
Graph Theory vis-à-vis Population Genetics
24
IBM Computational Biology Center
Different Models as Subgraphs
mtDNA TreeNRY Tree
Genetic Exchange Model (ARG)
Pedigree Graph GPG(K,N)each vertex has 2 parents
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)each vertex has 1 parent
2. Mixed Subgraph GPGE(K,N,M)No of vertices/row no more than KM
each vertex has 1 OR 2 parentsM is no. of completely linked segs in each extant unit
25
IBM Computational Biology Center
Different Models
GPG(4,8) GPTY(4,8) GPGE(4,8,2)
26
IBM Computational Biology Center
Different Models as Subgraphs
LCA GMRCA
LCA TMRCA
LCA GMRCA
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
27
IBM Computational Biology Center
GPGE(K,N,M) ARG
Ancestral Recombinations Graph Griffiths & Marjoram ‘97
Embellish GPGE(K,N,M) with Genetic Exchanges (GE) Each extant unit has M segments No vertex with zero ancestral segments (to extant units)
28
IBM Computational Biology Center
1. Plausible GE assignment?2. Can GPGE(K,N,M) go colorless?
Yes....through algorithmic subsampling…
Mixed Subgraph GPGE(K,N,M)
29
IBM Computational Biology Center
Algorithm: Embellish GPGE(K,N,M)
1. Assign sequence, s, to an instance eg. s = K, (2K), (2K-7), (2K-15), ……….
2. Construct M sequences si
Each si is monotonically decreasing; si[j] no bigger than s[j]
3. Associate each si with a segment and each element si[j] = k to k randomly selected vertices at depth j
30
IBM Computational Biology Center
Algorithm: Constructing seqs…
31
IBM Computational Biology Center
“Topological” Defn of LCAAin GPGE(K,N,M)
Input: GPGE(K,N,M) with GE embellishment
LCAA1.CA in all M subgraphs (trees)2.Least such CA
32
IBM Computational Biology Center
Different Models as Subgraphs
LCAA GMRCA
LCA TMRCA
LCAA GMRCA
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
33
IBM Computational Biology Center
Probability of Instances with Unique LCA/LCAA
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
34
IBM Computational Biology Center
GMRCA LCAA LCA & lone pair
TMRCA LCA
GMRCA LCAA LCA & lone node
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
“Topological” Defns of LCAA
35
IBM Computational Biology Center
Expected Depth E(D) of LCA/LCAA
O(N2)
O(K)
O(KM)
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
36
IBM Computational Biology Center
RECONSTRUCTABILITY
O(N2)
O(K)
O(KM)
Pedigree Graph GPG(K,N)
1. Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)
2. Mixed Subgraph GPGE(K,N,M)
37
IBM Computational Biology Center
Summary:History Reconstruction?
1. Mixed Subgraph models recombinations Only fragments of the chromosome
2. In reality, only a minimal structure (HUD) of the GPGE(K,N,M) or ARG can be estimated Forbidden structures ….
38
IBM Computational Biology Center
1. Motivation
2. Reconstructability (Random Graph Framework)
3. Reconstruction Algorithm (DSR Algorithm)
4. Conclusion
RoadMap
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009
39
IBM Computational Biology Center
OUTPUT: Recombinational Landscape (Recotypes)
INPUT: Chromosomes (haplotypes)
40
IBM Computational Biology Center
Granularity g
Analyze Results
YES
NO
IRiS
Acceptable p-value?
Our Approach
statistical
statistical
combinatorial
M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations ,
under submission.
41
IBM Computational Biology Center
Preprocess: Dimension reduction via Clustering
11 12 13 14 15 16 0
17 1 18 4
19 6 5
20 8 21 9 10 7 22
23 3 2 24
42
IBM Computational Biology Center
Granularity g
Analyze Results
YES
NO
IRiS
Acceptable p-value?
Analysis Flow
statistical
statistical
combinatorial
43
IBM Computational Biology Center
p-value Estimation
44
IBM Computational Biology Center
Comparison of the Randomization Schemes
45
IBM Computational Biology Center
SNP Blocks (granularity g=3)
46
IBM Computational Biology Center
Granularity g
Analyze Results
YES
NO
IRiS
Acceptable p-value?
Analysis Flow
statistical
statistical
combinatorial
47
IBM Computational Biology Center
Stage Haplotypes: use SNP block patterns
Segment along the length: infer trees
Infer network (ARG)
biological insights
computational insights
IRiS(Identifying Recombinations in Sequences)
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
48
IBM Computational Biology Center
Segmentation
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----
49
IBM Computational Biology Center
Segmentation
50
IBM Computational Biology Center
Consensus of Trees
51
IBM Computational Biology Center
Algorithm Design
1. Ensure compatibility of component trees
2. Parsimony model: minimize the no. of recombinations
52
IBM Computational Biology Center
Algorithm Design
1. Ensure compatibility of component trees
2. Parsimony model: minimize the no. of recombinations
Theorem: The problem is NP-Hard.
“It is impossible to design an algorithm that guarantees optimality.”
53
IBM Computational Biology Center
DSR Scheme (Dominant—Subdominant---Recombinant)
54
IBM Computational Biology Center
DSR Scheme: Level 1
55
IBM Computational Biology Center
DSR Assignment Rules
1. At most one D per row and column;
if no D, at most one S per row and column
2. At most one non-R in the row and column, but not both
56
IBM Computational Biology Center
DSR Assignment Rules
1. Each row and each columnhas at most one D
ELSE has at most one S
2. A non-R can have other non-Rs either in its row or its column but NOT both
57
IBM Computational Biology Center
DSR Scheme: Level 1
58
IBM Computational Biology Center
DSR Scheme: Level 2
59
IBM Computational Biology Center
DSR Scheme: Level 2
60
IBM Computational Biology Center
DSR Scheme: Level 3
61
IBM Computational Biology Center
DSR Scheme: Level 3
62
IBM Computational Biology Center
DSR Scheme: Level 4
63
IBM Computational Biology Center
DSR Scheme: Level 5
64
IBM Computational Biology Center
Mathematical Analysis: Approximation Factor
Greedy DSR Scheme Z and Y are computable functions of the input
L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009
65
IBM Computational Biology Center
Granularity g
Analyze Results
YES
NO
IRiS
Acceptable p-value?
Analysis Flow
statistical
statistical
combinatorial
66
IBM Computational Biology Center
IRiS Output: RECOTYPE
Recombination vectorsR1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 ……….
s1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 ……….
s2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 ……….
.
.
.
.
67
IBM Computational Biology Center
Quick Sanity Check:Ultrametric Network on RECOTYPES
68
IBM Computational Biology Center
Stage Haplotypes: use SNP block patterns
Segment along the length: infer trees
Infer network (ARG)
biological insights
computational insights
IRiS(Identifying Recombinations in Sequences)
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
IRiS software will be released by the end of summer ’09
Asif Javed
69
IBM Computational Biology Center
What’s in a name?
1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations
2. Detects subcontinental divide from short segments based on populations level analysis
3. Detects populations from short segments based on recombination events analysis
RECOMBIN-OMICS Jaume Bertranpetit
RECOMBIN-OMETRICS
Robert Elston
70
IBM Computational Biology Center
1. Allele-frequency variations between populations is also reflected in the purely recombination-based variations
2. Detects subcontinental divide from short segments based on populations level analysis
3. Detects populations from short segments based on recombination events analysis
Are we ready for the OMICS / OMETRICS?
o population-specific signals ?o other critical signals ?
o anything we didn’t already know?
71
IBM Computational Biology Center
Thank you!!
72
IBM Computational Biology Center