66

Comparative Genomics and de Bruijn graphs

Embed Size (px)

Citation preview

Page 1: Comparative Genomics and de Bruijn graphs

Comparative Genomics and the

de Bruijn graphs

Ilia Minkin

Pennsylvania State University

16th September 2016

1 / 43

Page 2: Comparative Genomics and de Bruijn graphs

What is comparative genomics?

�The collection of all research activities that derivebiological insights by comparing genomic features.�1

Why do it?

I Learn evolution

I Learn function

1�Comparative Genomics�, Xuhua Xia2 / 43

Page 3: Comparative Genomics and de Bruijn graphs

What is comparative genomics?

�The collection of all research activities that derivebiological insights by comparing genomic features.�1

Why do it?

I Learn evolution

I Learn function

1�Comparative Genomics�, Xuhua Xia2 / 43

Page 4: Comparative Genomics and de Bruijn graphs

Learn Evolution

3 / 43

Page 5: Comparative Genomics and de Bruijn graphs

Learn Function

A genomic sequence itself does not show its functions

How to �nd function?

I Compare with sequences of know function

I Conserved sequences are likely to be important

How to compare genomes?

4 / 43

Page 6: Comparative Genomics and de Bruijn graphs

What is an Alignment?

Organisms inherit genomes but with �errors�:

The Ancestor

Genome A Genome B

Which characters A and B got from its ancestor?

5 / 43

Page 7: Comparative Genomics and de Bruijn graphs

What is an Alignment?

Alignments are written down as a table:

ACTG-TGAACTACTGA

Blue letters are matches; yellow are mismatches;dashes are indels.

This is a global alignment.

6 / 43

Page 8: Comparative Genomics and de Bruijn graphs

The Global Alignment

ACTG-TGAACTACTGA

For two strings A and B :

I Place them under each other

I Insert into A and B dashes so that |A| = |B|

I Penalize for dashes and mismatches

I Which alignment gives the least penalty?

I Complexity: O(|A||B|)

7 / 43

Page 9: Comparative Genomics and de Bruijn graphs

The Local AlignmentFor large sequences the global alignment does notwork:

GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA

I Apart from indels and mismatches there couldbe rearrangements

I Rearrangements change orders of the wholeblocks

I Similar subsequences can be interleaved withsomething else

8 / 43

Page 10: Comparative Genomics and de Bruijn graphs

The Local AlignmentFor large sequences the global alignment does notwork:

GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA

I Apart from indels and mismatches there couldbe rearrangements

I Rearrangements change orders of the wholeblocks

I Similar subsequences can be interleaved withsomething else

8 / 43

Page 11: Comparative Genomics and de Bruijn graphs

The Local Alignment

GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA

A problem: for A and B �nd their most similarsubsequences and their alignment:

ACTG-TGAACTACTGA

Complexity: O(|A||B|)

9 / 43

Page 12: Comparative Genomics and de Bruijn graphs

An ExampleWe can generalize to many genomes:

GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA

Alignments:

ACTG-TGAACTACTGAA-TGCTCA

10 / 43

Page 13: Comparative Genomics and de Bruijn graphs

An ExampleWe can generalize to many genomes:

GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA

Alignments:

ACTG-TGAACTACTGAA-TGCTCA

10 / 43

Page 14: Comparative Genomics and de Bruijn graphs

Multiple Local Alignment

Issues:

I Some subsequences can be present in somegenomes and absent in others

I Genomes can have duplications

I Multiple sequence alignment is NP-hard

→ We need some heuristics

11 / 43

Page 15: Comparative Genomics and de Bruijn graphs

Multiple Local Alignment

Issues:

I Some subsequences can be present in somegenomes and absent in others

I Genomes can have duplications

I Multiple sequence alignment is NP-hard

→ We need some heuristics

11 / 43

Page 16: Comparative Genomics and de Bruijn graphs

Another Approach

Another way to �nd common subsequences is tobuild a graph from genomes

In such a graph homologous subsequences willcollapse into non-branching paths while unique oneswill form disjoint paths

12 / 43

Page 17: Comparative Genomics and de Bruijn graphs

The Linear Representation

Two genomes:

13 / 43

Page 18: Comparative Genomics and de Bruijn graphs

Solution: a Graph Representation

What we want to see:

14 / 43

Page 19: Comparative Genomics and de Bruijn graphs

Genomes as a Railroad

15 / 43

Page 20: Comparative Genomics and de Bruijn graphs

Why de Bruijn graph?

A simple object.

Demonstrated utility in:

I Assembly

I Read mapping

I Synteny identi�cation

16 / 43

Page 21: Comparative Genomics and de Bruijn graphs

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

Page 22: Comparative Genomics and de Bruijn graphs

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

Page 23: Comparative Genomics and de Bruijn graphs

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

Page 24: Comparative Genomics and de Bruijn graphs

The de Bruijn Graph

AC GTCGGATG TC

AC TTGATG TCCT

AC

GT

TT

CG

GATG TC

CT

18 / 43

Page 25: Comparative Genomics and de Bruijn graphs

The de Bruijn Graph

AC GTCGGATG TC

AC TTGATG TCCT

AC

GT

TT

CG

GATG TC

CT

18 / 43

Page 26: Comparative Genomics and de Bruijn graphs

The de Bruijn graph

In the de Bruijn graph identical substrings of lengthat least k + 1 are collapsed into non-branching paths

We can use this to �nd homologous blocks.

We developed a tool �Sibelia" that �nds such blocksin many bacterial genomes and handles repeats.

But we can do more.

19 / 43

Page 27: Comparative Genomics and de Bruijn graphs

Alignment to a GraphIt is common to have an unassembled genome

Reads are then aligned to a very similar referencegenome:

20 / 43

Page 28: Comparative Genomics and de Bruijn graphs

Alignment to a GraphIssues:

I More than one reference?I Repeats within genomes?

Solution: align reads to a graph!

21 / 43

Page 29: Comparative Genomics and de Bruijn graphs

Alignment to a GraphIssues:

I More than one reference?I Repeats within genomes?

Solution: align reads to a graph!

21 / 43

Page 30: Comparative Genomics and de Bruijn graphs

Alignment to a Graph

In the future genome graphs will encode informationabout a population

Aligning reads to a graph has many advantages:

I E�cient alignment to many genomes

I Reusing information about variants

I Handling of repeats

The de Bruijn graph is a feasible model for a graphreference.

Issue the graph can be too large.

22 / 43

Page 31: Comparative Genomics and de Bruijn graphs

Alignment to a Graph

In the future genome graphs will encode informationabout a population

Aligning reads to a graph has many advantages:

I E�cient alignment to many genomes

I Reusing information about variants

I Handling of repeats

The de Bruijn graph is a feasible model for a graphreference.

Issue the graph can be too large.

22 / 43

Page 32: Comparative Genomics and de Bruijn graphs

Alignment to a Graph

In the future genome graphs will encode informationabout a population

Aligning reads to a graph has many advantages:

I E�cient alignment to many genomes

I Reusing information about variants

I Handling of repeats

The de Bruijn graph is a feasible model for a graphreference.

Issue the graph can be too large.

22 / 43

Page 33: Comparative Genomics and de Bruijn graphs

Alignment to a Graph

In the future genome graphs will encode informationabout a population

Aligning reads to a graph has many advantages:

I E�cient alignment to many genomes

I Reusing information about variants

I Handling of repeats

The de Bruijn graph is a feasible model for a graphreference.

Issue the graph can be too large.

22 / 43

Page 34: Comparative Genomics and de Bruijn graphs

Compaction

After compaction:

TGAC ACGTC

ACTTCTG AC TC

23 / 43

Page 35: Comparative Genomics and de Bruijn graphs

Compaction

After compaction:

TGAC ACGTC

ACTTCTG AC TC

23 / 43

Page 36: Comparative Genomics and de Bruijn graphs

The Challenge

Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.

Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.

A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.

24 / 43

Page 37: Comparative Genomics and de Bruijn graphs

The Challenge

Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.

Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.

A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.

24 / 43

Page 38: Comparative Genomics and de Bruijn graphs

The Challenge

Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.

Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.

A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.

24 / 43

Page 39: Comparative Genomics and de Bruijn graphs

Junctions

A vertex v is a junction if:

I v has ≥ 2 distinct outgoing or incoming edges:

I v is the �rst or the last k-mer of an input string

Facts:

I Junctions = vertices of the compacted graph

I Compaction = �nding positions of junctions

25 / 43

Page 40: Comparative Genomics and de Bruijn graphs

Junctions

A vertex v is a junction if:

I v has ≥ 2 distinct outgoing or incoming edges:

I v is the �rst or the last k-mer of an input string

Facts:

I Junctions = vertices of the compacted graph

I Compaction = �nding positions of junctions

25 / 43

Page 41: Comparative Genomics and de Bruijn graphs

Junctions

A vertex v is a junction if:

I v has ≥ 2 distinct outgoing or incoming edges:

I v is the �rst or the last k-mer of an input string

Facts:

I Junctions = vertices of the compacted graph

I Compaction = �nding positions of junctions

25 / 43

Page 42: Comparative Genomics and de Bruijn graphs

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

Page 43: Comparative Genomics and de Bruijn graphs

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

Page 44: Comparative Genomics and de Bruijn graphs

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

Page 45: Comparative Genomics and de Bruijn graphs

The Observation

The observation only works when we have completegenomes.

Once we know junctions, construction of the edges issimple.

We can simply traverse input strings and recordjunctions in the order they appear.

How to identify junctions?

27 / 43

Page 46: Comparative Genomics and de Bruijn graphs

The Naive Algorithm

A naive way:

I Store all (k + 1)-mers (edges) in a hash table

I Consider each vertex one by one

I Query all possible edges from the table

I If found > 1 edge, mark vertex as a junction

Problem: the hash table can be too large.

28 / 43

Page 47: Comparative Genomics and de Bruijn graphs

The Naive Algorithm

A naive way:

I Store all (k + 1)-mers (edges) in a hash table

I Consider each vertex one by one

I Query all possible edges from the table

I If found > 1 edge, mark vertex as a junction

Problem: the hash table can be too large.

28 / 43

Page 48: Comparative Genomics and de Bruijn graphs

An ExampleHash table = { GA → AC }

AA

AG

AC

AT

GA

29 / 43

Page 49: Comparative Genomics and de Bruijn graphs

What is the Bloom �lter

A probabilistic data structure representing a set

Properties:

I Occupies �xed space

I May generate false positives on queries

I False positive rate is low

Example: Bloom Filter = { GA → AC }

Is GA → AC in the set? Yes.

Is GA → AT in the set? Maybe no.

30 / 43

Page 50: Comparative Genomics and de Bruijn graphs

What is the Bloom �lter

A probabilistic data structure representing a set

Properties:

I Occupies �xed space

I May generate false positives on queries

I False positive rate is low

Example: Bloom Filter = { GA → AC }

Is GA → AC in the set? Yes.

Is GA → AT in the set? Maybe no.

30 / 43

Page 51: Comparative Genomics and de Bruijn graphs

What is the Bloom �lter

A probabilistic data structure representing a set

Properties:

I Occupies �xed space

I May generate false positives on queries

I False positive rate is low

Example: Bloom Filter = { GA → AC }

Is GA → AC in the set? Yes.

Is GA → AT in the set? Maybe no.

30 / 43

Page 52: Comparative Genomics and de Bruijn graphs

An ExampleBloom Filter = { GA → AC, GA → AT }

AA

AG

AC

AT

GA

The purple edge is a false positive.31 / 43

Page 53: Comparative Genomics and de Bruijn graphs

The Two Pass Algorithm

How to eliminate false positives?

Two-pass algorithm:

1. Use the Bloom �lter to identify junctioncandidates

2. Use the hash table, but store only edges that

touch candidates

32 / 43

Page 54: Comparative Genomics and de Bruijn graphs

The Two Pass Algorithm

How to eliminate false positives?

Two-pass algorithm:

1. Use the Bloom �lter to identify junctioncandidates

2. Use the hash table, but store only edges that

touch candidates

32 / 43

Page 55: Comparative Genomics and de Bruijn graphs

An Example: the First Step

Here edges stored in the Bloom �lter, purple ones arefalse positives:

AC GT

CC

TT

CG

AT

GATG

TC

CT

Junction candidates: GA & AC

33 / 43

Page 56: Comparative Genomics and de Bruijn graphs

An Example: the Second Step

Edges stored in the hash table. We kept only edgestouching junction candidates:

Junction: AC34 / 43

Page 57: Comparative Genomics and de Bruijn graphs

Results

Datasets:

I 7 humans: 5 versions of the reference +2 haplotypes of NA12878 from 1000 Genomes

I 93 simulated humans (FIGG)

I 8 primates available in UCSC genome browser

35 / 43

Page 58: Comparative Genomics and de Bruijn graphs

Results

Running time (minutes) & memory usage (GBs).

# genomes BWT-based TwoPaCo1 thread 1 thread 15 threads

Humans7, k = 25 867 (100.30) 436 (4.40) 63 (4.84)7, k = 100 807 (46.02) 317 (8.42) 57 (8.75)43+7, k = 25 - - 705 (69.77)43+7, k = 100 - - 927 (70.21)93+7, k = 25 - - 1383 (77.42)Primates8, k = 25 - 914 (34.36) 111 (34.36)8,k = 100 - 756 (56.06) 101 (61.68)

36 / 43

Page 59: Comparative Genomics and de Bruijn graphs

Conclusion & Future Work

Advantages of the algorithm:

I Fast

I Small memory footprint

I Can handle large inputs

Drawbacks:

I Less applicable for large k

Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.

37 / 43

Page 60: Comparative Genomics and de Bruijn graphs

Conclusion & Future Work

Advantages of the algorithm:

I Fast

I Small memory footprint

I Can handle large inputs

Drawbacks:

I Less applicable for large k

Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.

37 / 43

Page 61: Comparative Genomics and de Bruijn graphs

Conclusion & Future Work

Can potentially facilitate:

I Visualization

I Synteny mining (Sibelia)

I Structural variations analysis

I ...

38 / 43

Page 62: Comparative Genomics and de Bruijn graphs

Acknowledgments

Personal:

I Daniel Lemire

I GFA format working group

Funding, NSF awards:

I DBI-1356529

I CCF-1439057

I IIS-1453527

I IIS-1421908

39 / 43

Page 63: Comparative Genomics and de Bruijn graphs

Thank you for your attention!

Twitter: @IliaMinkin

40 / 43

Page 64: Comparative Genomics and de Bruijn graphs

Input Size vs. Performance

41 / 43

Page 65: Comparative Genomics and de Bruijn graphs

Parallel Scalability

42 / 43

Page 66: Comparative Genomics and de Bruijn graphs

Splitting

Table 1: The minimal number of rounds it takes to compressthe graph without exceeding a given memory threshold.

Memory threshold Used memory Bloom �lter size Running time Rounds

10 8.62 8.59 259 1

8 6.73 4.29 434 3

6 5.98 4.29 539 4

4 3.51 2.14 665 6

43 / 43