Comparative Genomics and de Bruijn graphs

Comparative Genomics and the

de Bruijn graphs

Ilia Minkin

Pennsylvania State University

16th September 2016

1 / 43

What is comparative genomics?

�The collection of all research activities that derivebiological insights by comparing genomic features.�1

Why do it?

I Learn evolution

I Learn function

1�Comparative Genomics�, Xuhua Xia2 / 43

What is comparative genomics?

�The collection of all research activities that derivebiological insights by comparing genomic features.�1

Why do it?

I Learn evolution

I Learn function

1�Comparative Genomics�, Xuhua Xia2 / 43

Learn Evolution

3 / 43

Learn Function

A genomic sequence itself does not show its functions

How to �nd function?

I Compare with sequences of know function

I Conserved sequences are likely to be important

How to compare genomes?

4 / 43

What is an Alignment?

Organisms inherit genomes but with �errors�:

The Ancestor

Genome A Genome B

Which characters A and B got from its ancestor?

5 / 43

What is an Alignment?

Alignments are written down as a table:

ACTG-TGAACTACTGA

Blue letters are matches; yellow are mismatches;dashes are indels.

This is a global alignment.

6 / 43

The Global Alignment

ACTG-TGAACTACTGA

For two strings A and B :

I Place them under each other

I Insert into A and B dashes so that |A| = |B|

I Penalize for dashes and mismatches

I Which alignment gives the least penalty?

I Complexity: O(|A||B|)

7 / 43

The Local AlignmentFor large sequences the global alignment does notwork:

GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA

I Apart from indels and mismatches there couldbe rearrangements

I Rearrangements change orders of the wholeblocks

I Similar subsequences can be interleaved withsomething else

8 / 43

The Local AlignmentFor large sequences the global alignment does notwork:


I Apart from indels and mismatches there couldbe rearrangements

I Rearrangements change orders of the wholeblocks

I Similar subsequences can be interleaved withsomething else

8 / 43

The Local Alignment


A problem: for A and B �nd their most similarsubsequences and their alignment:

ACTG-TGAACTACTGA

Complexity: O(|A||B|)

9 / 43

An ExampleWe can generalize to many genomes:

GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA

Alignments:

ACTG-TGAACTACTGAA-TGCTCA

10 / 43

An ExampleWe can generalize to many genomes:

GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA

Alignments:

ACTG-TGAACTACTGAA-TGCTCA

10 / 43

Multiple Local Alignment

Issues:

I Some subsequences can be present in somegenomes and absent in others

I Genomes can have duplications

I Multiple sequence alignment is NP-hard

→ We need some heuristics

11 / 43

Multiple Local Alignment

Issues:

I Some subsequences can be present in somegenomes and absent in others

I Genomes can have duplications

I Multiple sequence alignment is NP-hard

→ We need some heuristics

11 / 43

Another Approach

Another way to �nd common subsequences is tobuild a graph from genomes

In such a graph homologous subsequences willcollapse into non-branching paths while unique oneswill form disjoint paths

12 / 43

The Linear Representation

Two genomes:

13 / 43

Solution: a Graph Representation

What we want to see:

14 / 43

Genomes as a Railroad

15 / 43

Why de Bruijn graph?

A simple object.

Demonstrated utility in:

I Assembly

I Read mapping

I Synteny identi�cation

16 / 43

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

The de Bruijn Graph

k = 2

TGACGTC TGACTTC

AC GTCGGATG TC

AC TTGATG TCCT

17 / 43

The de Bruijn Graph

AC GTCGGATG TC

AC TTGATG TCCT

AC

GT

TT

CG

GATG TC

CT

18 / 43

The de Bruijn Graph

AC GTCGGATG TC

AC TTGATG TCCT

AC

GT

TT

CG

GATG TC

CT

18 / 43

The de Bruijn graph

In the de Bruijn graph identical substrings of lengthat least k + 1 are collapsed into non-branching paths

We can use this to �nd homologous blocks.

We developed a tool �Sibelia" that �nds such blocksin many bacterial genomes and handles repeats.

But we can do more.

19 / 43

Alignment to a GraphIt is common to have an unassembled genome

Reads are then aligned to a very similar referencegenome:

20 / 43

Alignment to a GraphIssues:

I More than one reference?I Repeats within genomes?

Solution: align reads to a graph!

21 / 43

Alignment to a GraphIssues:

I More than one reference?I Repeats within genomes?

Solution: align reads to a graph!

21 / 43

Alignment to a Graph

In the future genome graphs will encode informationabout a population

Aligning reads to a graph has many advantages:

I E�cient alignment to many genomes

I Reusing information about variants

I Handling of repeats

The de Bruijn graph is a feasible model for a graphreference.

Issue the graph can be too large.

22 / 43









22 / 43









22 / 43









22 / 43

Compaction

After compaction:

TGAC ACGTC

ACTTCTG AC TC

23 / 43

Compaction

After compaction:

TGAC ACGTC

ACTTCTG AC TC

23 / 43

The Challenge

Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.

Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.

A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.

24 / 43

The Challenge




24 / 43

The Challenge




24 / 43

Junctions

A vertex v is a junction if:

I v has ≥ 2 distinct outgoing or incoming edges:

I v is the �rst or the last k-mer of an input string

Facts:

I Junctions = vertices of the compacted graph

I Compaction = �nding positions of junctions

25 / 43

Junctions




Facts:



25 / 43

Junctions




Facts:



25 / 43

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

Observations

TGAC ACGTC

ACTTCTG AC TC

TG GA AC CG GT TC

TG → AC → TC

26 / 43

The Observation

The observation only works when we have completegenomes.

Once we know junctions, construction of the edges issimple.

We can simply traverse input strings and recordjunctions in the order they appear.

How to identify junctions?

27 / 43

The Naive Algorithm

A naive way:

I Store all (k + 1)-mers (edges) in a hash table

I Consider each vertex one by one

I Query all possible edges from the table

I If found > 1 edge, mark vertex as a junction

Problem: the hash table can be too large.

28 / 43

The Naive Algorithm

A naive way:

I Store all (k + 1)-mers (edges) in a hash table

I Consider each vertex one by one

I Query all possible edges from the table

I If found > 1 edge, mark vertex as a junction

Problem: the hash table can be too large.

28 / 43

An ExampleHash table = { GA → AC }

AA

AG

AC

AT

GA

29 / 43

What is the Bloom �lter

A probabilistic data structure representing a set

Properties:

I Occupies �xed space

I May generate false positives on queries

I False positive rate is low

Example: Bloom Filter = { GA → AC }

Is GA → AC in the set? Yes.

Is GA → AT in the set? Maybe no.

30 / 43



Properties:







30 / 43



Properties:







30 / 43

An ExampleBloom Filter = { GA → AC, GA → AT }

AA

AG

AC

AT

GA

The purple edge is a false positive.31 / 43

The Two Pass Algorithm

How to eliminate false positives?

Two-pass algorithm:

1. Use the Bloom �lter to identify junctioncandidates

2. Use the hash table, but store only edges that

touch candidates

32 / 43

The Two Pass Algorithm

How to eliminate false positives?

Two-pass algorithm:

1. Use the Bloom �lter to identify junctioncandidates

2. Use the hash table, but store only edges that

touch candidates

32 / 43

An Example: the First Step

Here edges stored in the Bloom �lter, purple ones arefalse positives:

AC GT

CC

TT

CG

AT

GATG

TC

CT

Junction candidates: GA & AC

33 / 43

An Example: the Second Step

Edges stored in the hash table. We kept only edgestouching junction candidates:

Junction: AC34 / 43

Results

Datasets:

I 7 humans: 5 versions of the reference +2 haplotypes of NA12878 from 1000 Genomes

I 93 simulated humans (FIGG)

I 8 primates available in UCSC genome browser

35 / 43

Results

Running time (minutes) & memory usage (GBs).

# genomes BWT-based TwoPaCo1 thread 1 thread 15 threads

Humans7, k = 25 867 (100.30) 436 (4.40) 63 (4.84)7, k = 100 807 (46.02) 317 (8.42) 57 (8.75)43+7, k = 25 - - 705 (69.77)43+7, k = 100 - - 927 (70.21)93+7, k = 25 - - 1383 (77.42)Primates8, k = 25 - 914 (34.36) 111 (34.36)8,k = 100 - 756 (56.06) 101 (61.68)

36 / 43

Conclusion & Future Work

Advantages of the algorithm:

I Fast

I Small memory footprint

I Can handle large inputs

Drawbacks:

I Less applicable for large k

Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.

37 / 43


Advantages of the algorithm:

I Fast

I Small memory footprint

I Can handle large inputs

Drawbacks:

I Less applicable for large k

Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.

37 / 43


Can potentially facilitate:

I Visualization

I Synteny mining (Sibelia)

I Structural variations analysis

I ...

38 / 43

Acknowledgments

Personal:

I Daniel Lemire

I GFA format working group

Funding, NSF awards:

I DBI-1356529

I CCF-1439057

I IIS-1453527

I IIS-1421908

39 / 43

Thank you for your attention!

Twitter: @IliaMinkin

40 / 43

Input Size vs. Performance

41 / 43

Parallel Scalability

42 / 43

Splitting

Table 1: The minimal number of rounds it takes to compressthe graph without exceeding a given memory threshold.

Memory threshold Used memory Bloom �lter size Running time Rounds

10 8.62 8.59 259 1

8 6.73 4.29 434 3

6 5.98 4.29 539 4

4 3.51 2.14 665 6

43 / 43

Science

Comparative Genomics and de Bruijn graphs