23
Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19

Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Genomes Comparision via de Bruijngraphs

Student: Ilya MinkinAdvisor: Son Pham

St. Petersburg Academic University

June 4, 2012

1 / 19

Page 2: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Synteny Blocks: Algorithmic challenge

I Suppose that we are given two genomes

I The question is: how are they evolutionaryrelated to each other?

I In order to do rearrangements analysis we mustdecompose genomes into synteny blocks

I Synteny blocks are evolutionary conservedsegments of the genome

I These blocks cover most of the genome

I Occur in both genomes with possible variations

2 / 19

Page 3: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Academic Project

Project: Identify synteny blocks for duplicatedgenomes represented as sequences of nucleotides.

I None of the previous synteny blocksreconstruction software (DRIMM-Synteny(Pham And Pevzner 2010) included) canefficiently solve this problem.

I DRIMM-Synteny can find the synteny blocksfor complicated genomes. But:

I It requires the genome to be represented assequence of genes.

3 / 19

Page 4: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Academic Project

Project: Identify synteny blocks for duplicatedgenomes represented as sequences of nucleotides.

I None of the previous synteny blocksreconstruction software (DRIMM-Synteny(Pham And Pevzner 2010) included) canefficiently solve this problem.

I DRIMM-Synteny can find the synteny blocksfor complicated genomes. But:

I It requires the genome to be represented assequence of genes.

3 / 19

Page 5: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

General Idea: de Bruijn GraphI We are given an alphabet Σ and a string S

over it, |Σ| = mI A substring T , |T | = k is called k-merI De Bruijn graph is a multigraph Gk = (V ,E ),

whereV = Σk−1 = {all possible (k − 1)-mers}

I If k-mer T is presented in S , then we add anoriented edge (T [1, k − 1],T [2, k]) to thegraph

I Create de Bruijn graph from the nucleotidesequence

I Conserved regions will yield non-branchingpaths

4 / 19

Page 6: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Challenges

I Variations in synteny blocks generate cycles, sowe need to simplify the graph

I Double strandness: conserved regions mayoccur on both strands. Example:5’ AACCGGTT 3’3’ TTGGCCAA 5’Such blocks are reverse complementary to eachother ⇒ no non-branching paths

I Spurious similarity

I Memory efficiency

5 / 19

Page 7: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Colored graphI We use colored de Bruijn graphs

[Iqball et al., 2012] to handle double-strandness

I Suppose that S+ and S− are positive andnegative strands of the chromosome

I Colored de Bruijn graph is a multigraphGk = (V ,E ) where V = Σk−1

I For each k-mer T+ in S+ add edge(T+[1, k − 1],T+[2, k]) to Gk and mark itblue

I For each k-mer T− in S− add edge(T−[1, k − 1],T−[2, k]) to Gk and mark itred

6 / 19

Page 8: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Edge labeling

I Note that our graph is built from a string, notset of reads

I Each walk in the graph represents a string

I We are interested only in walks that representsubstrings of the source string

I Assign to each edge e label L(e) = position ofthe corresponding k-mer on the positive strand

I Walk W = (v1 e1 v2 e2 ...) is considered valid iff:1. ei and ei+1 are of the same color2. |L(ei)− L(ei+1)| = 1

7 / 19

Page 9: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Example

ac ct7

cc0

ca

3

tg2

6

1 ag

6

2gt

3

ga

5

tc4

4

5

7

gg

10

5' ACCTGTCAGT 3'3' TGGACAGTCA 5'

Figure 1: Colored de Bruijn graph built from two strands

8 / 19

Page 10: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Graph simplificationI Bulges spoil long non-branching paths and

indicate indels/mismatchesI A pair of walks (W1,W2) is a bulge iff:

1) Start and end vertices of W1 and W2

coincide2) W1 and W2 have exactly 2 common vertices3) There are no edges u ∈ W1 and v ∈ W2

such that L(u) = L(v)4) |W1| ≤ δ and |W2| ≤ δ

...

...

U V

Figure 2: A bulge 9 / 19

Page 11: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

General pipelineI Build de Bruijn graph from the genomeI Remove bulges (BFS-like algorithm)I Bulges are removed by replacing long branches

with shorter onesI Output non-branching paths

A CB

X Y

A CB

Figure 3: Bulge removal illustration10 / 19

Page 12: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Parameters selection

I How should we choose K and δ?

I Duplicated genes can have no long (K > 50)shared K - mers

I Big K ∼ 50 – we find only few synteny blocks

I Small K ∼ 10 and small δ ∼ 15 – we find veryshort synteny blocks

I Small K ∼ 10 and big δ ∼ 200 – the genomewill be disrupted completely

I Solution – do simplification in multiple stages

11 / 19

Page 13: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Parameters selection

I How should we choose K and δ?

I Duplicated genes can have no long (K > 50)shared K - mers

I Big K ∼ 50 – we find only few synteny blocks

I Small K ∼ 10 and small δ ∼ 15 – we find veryshort synteny blocks

I Small K ∼ 10 and big δ ∼ 200 – the genomewill be disrupted completely

I Solution – do simplification in multiple stages

11 / 19

Page 14: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

New pipeline

I General idea – ”align” similar regions first, thenglue them together into synteny blocks

I Start with small K and small δ to smoothduplicated regions and obtain long K -mers

I Rebuild and simplify the graph with higher Kand δ

I Continue this process several times

I Final step can be done with K ∼ severalhundreds

12 / 19

Page 15: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

ExperimentI We have attempted to identify duplications inArabidopsis thaliana

I Arabidopsis is known to be highly duplicatedgenome [Arabidopsis Genome Initiative]

I Size of the genome is ∼ 120MbpI We used 4 stages and following parameters:

Stage number K δ1 15 1502 50 5003 100 10004 500 5000

13 / 19

Page 16: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Computation results

I We have found 4722 synteny blocks inArabidopsis

I These blocks cover 28 % of the genome

I Minimum length of the block is 1000 bp

I Largest block found has length ∼ 95 000 bp

I We tried to verify blocks by aligning instancesof the same block

I At least 87 % of blocks have 50 % of exactmatches

14 / 19

Page 17: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Computation results

Figure 4: Matches percent vs. number of blocks plot15 / 19

Page 18: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Computation results

Figure 5: Synteny blocks length distribution16 / 19

Page 19: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Summary

I We have covered 28 % of Arabidopsis genomewith synteny blocks

I But we have missed some duplicated regions,described in [Arabidopsis Genome Initiative]

I Most of the blocks are short (< 5000 bp)

I We must improve coverage and ”lengthen” theblocks

Near plans:

I Improve performance

I Examine other genomes

I Optimize algorithms to handle larger genomes

17 / 19

Page 20: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Summary

I We have covered 28 % of Arabidopsis genomewith synteny blocks

I But we have missed some duplicated regions,described in [Arabidopsis Genome Initiative]

I Most of the blocks are short (< 5000 bp)

I We must improve coverage and ”lengthen” theblocks

Near plans:

I Improve performance

I Examine other genomes

I Optimize algorithms to handle larger genomes

17 / 19

Page 21: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Applications

I A multiple sequence alignment program

I Finding the synteny blocks for complicatedgenomes (Mammalian), possible collaboration– Jian Ma.

I Tool for genomes vs genomes and/orassemblies vs. assemblies and/or assemblies vs.genomes comparisions.

18 / 19

Page 22: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

ReferencesI 1. Pevzner P and Tesler G, (2003) Human and

mouse genomic sequences reveal extensivebreakpoint reuse in mammalian evolution.

I 2. Pham S and Pevzner P, (2010)DRIMM-Synteny: Decomposing Genomes intoEvolutionary Conserved Segments

I 3. Iqbal Z, Caccamo M, Turner I, Flicek P,McVean G, (2012) De novo assembly andgenotyping of variants using colored de Bruijngraphs

I 4. Arabidopsis Genome Initiative, (2000)Analysis of the genome sequence of theflowering plant Arabidopsis thaliana

19 / 19

Page 23: Genomes Comparision via de Bruijn graphsmit.spbau.ru/files/Minkin.pdf · 2015-03-11 · Colored graph I We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness

Thank you!

19 / 19