43
The Science of Information: From Communication to DNA Sequencing David Tse U.C. Berkeley CUHK December 14, 2012 Research supported by NSF Center for Science of Information.

The Science of Information: From Communication to DNA Sequencing

  • Upload
    august

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

The Science of Information: From Communication to DNA Sequencing. David Tse U.C. Berkeley CUHK December 14, 2012 Research supported by NSF Center for Science of Information. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. Communication: the beginning. - PowerPoint PPT Presentation

Citation preview

Page 1: The Science of Information: From Communication to DNA Sequencing

The Science of Information:From Communication to DNA Sequencing

David TseU.C. Berkeley

CUHKDecember 14, 2012

Research supported by NSF Center for Science of Information.

Page 2: The Science of Information: From Communication to DNA Sequencing

Communication: the beginning

• Prehistoric: smoke signals, drums.• 1837: telegraph• 1876: telephone• 1897: radio• 1927: television

Communication design tied to the specific source and specific physical medium.

Page 3: The Science of Information: From Communication to DNA Sequencing

Grand Unification

channel capacity C bits/ secsource entropy rateH bits/ source sym

Shannon 48

Theorem:max. rate of reliable communication

= CH source sym / sec.

Model all sources and channels statistically.

A unified way of looking at all communication problems in terms of information flow.

source reconstructed source

Page 4: The Science of Information: From Communication to DNA Sequencing

60 Years Later

• All communication systems are designed based on the principles of information theory.

• A benchmark for comparing different schemes and different channels.

• Suggests totally new ways of communication (eg. MIMO, opportunistic communication).

Page 5: The Science of Information: From Communication to DNA Sequencing

Secrets of Success

• Information, then computation.

It took 60 years, but we got there.

• Simple models, then complex.

The discrete memoryless channel………… is like the Holy Roman Empire.

Page 6: The Science of Information: From Communication to DNA Sequencing

Looking Forward

Can the success of this way of thinking be broadened to other fields?

Page 7: The Science of Information: From Communication to DNA Sequencing

Information Theory of DNA Sequencing

Page 8: The Science of Information: From Communication to DNA Sequencing

DNA sequencing

A basic workhorse of modern biology and medicine.

Problem: to obtain the sequence of nucleotides.

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

courtesy: Batzoglou

Page 9: The Science of Information: From Communication to DNA Sequencing

Impetus: Human Genome Project

1990: Start

2001: Draft

2003: Finished3 billion nucleotides

courtesy: Batzoglou

3 billion $$$$

Page 10: The Science of Information: From Communication to DNA Sequencing

Sequencing gets cheaper and faster

Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300

Time to sequence one genome: years days

Massive parallelization.

courtesy: Batzoglou

Page 11: The Science of Information: From Communication to DNA Sequencing

But many genomes to sequence

100 million species(e.g. phylogeny)

7 billion individuals (SNP, personal genomics)

1013 cells in a human(e.g. somatic mutations

such as HIV, cancer) courtesy: Batzoglou

Page 12: The Science of Information: From Communication to DNA Sequencing

Whole Genome Shotgun Sequencing

Reads are assembled to reconstruct the original DNA sequence.

Number of reads

read length L ¼ 100 - 1000 N ¼ 108

genome length G ¼ 109

Page 13: The Science of Information: From Communication to DNA Sequencing

A Gigantic Jigsaw Puzzle

Page 14: The Science of Information: From Communication to DNA Sequencing

Many Sequencing Technologies • HGP era: single technology (Sanger)

• Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)

• Each technology has different read length, noise statistics, etc

Eg.: Illumina: L = 50 to 200, error ~ 1 % substitution

Pac Bio: L = 2000 to 4000, error ~ 10-15% indels

Page 15: The Science of Information: From Communication to DNA Sequencing

Many assembly algorithms

Source: Wikipedia

Page 16: The Science of Information: From Communication to DNA Sequencing

And many more…….

A grand total of 42!

Page 17: The Science of Information: From Communication to DNA Sequencing

Computational View

“Since it is well known that the assembly problem is NP-hard, …………”

• algorithm design based largely on heuristics• no optimality or performance guarantees

But NP-hardness does not mean it is hopeless to be close to optimal.

Can we first define optimality without regard to computational complexity?

Page 18: The Science of Information: From Communication to DNA Sequencing

Information theoretic view

• Given a statistical model, what is the read length L and number of reads N needed to reconstruct with probability 1-ε ?

• Are there computationally efficient assembly algorithms that perform close to the fundamental limits?

Open questions!

Page 19: The Science of Information: From Communication to DNA Sequencing

• Reads are uniformly sampled from the DNA sequence.

• Read process is noiseless.

Impact of noise: later.

A basic read model

Page 20: The Science of Information: From Communication to DNA Sequencing

Coverage Analysis

• Pioneered by Lander-Waterman in 1988.

• What is the number of reads needed to cover the entire DNA sequence with probability 1-²?

• Ncov only provides a lower bound on the number of reads needed for reconstruction.

• Ncov does not depend on the DNA statistics!

Page 21: The Science of Information: From Communication to DNA Sequencing

Repeat statistics do matter!

easier jigsaw puzzle harder jigsaw puzzle

How exactly do the fundamental limits depend on repeat statistics?

Page 22: The Science of Information: From Communication to DNA Sequencing

reconstructable by greedy algorithm

Simple model: I.I.D. DNA, G ! 1 (Motahari, Bresler & T. 12)

read length L

1

many repeats of length L

no repeatsof length L

normalized # of reads

coverage

no coverage

What about for finite real DNA?

Page 23: The Science of Information: From Communication to DNA Sequencing

`

log(# of -̀repeats)

i.i.d. fit data

I.I.D. DNA vs real DNA

Example: human chromosome 22 (build GRCh37, G = 35M)

(Bresler, Bresler & T. 12)

Can we derive performance bounds directly in terms of empirical repeat statistics?

Page 24: The Science of Information: From Communication to DNA Sequencing

Lower bound: Interleaved repeats

Necessary condition:all interleaved repeats are bridged.

L

m m nn

In particular: L > longest interleaved repeat length (Ukkonen)

Page 25: The Science of Information: From Communication to DNA Sequencing

Lower bound: Triple repeats

Necessary condition:

all triple repeats are bridged

In particular: L > longest triple repeat length (Ukkonen)

L

Page 26: The Science of Information: From Communication to DNA Sequencing

`

log(# of -̀repeats)

Chromosome 22 (Lower Bound)

GRCh37 Chr 22 (G = 35M)

triple repeat

interleaved repeat

coverage

what is achievable?

Page 27: The Science of Information: From Communication to DNA Sequencing

Greedy algorithm (TIGR Assembler, phrap, CAP3...)

Input: the set of N reads of length L

1. Set the initial set of contigs as the reads

2. Find two contigs with largest overlap and merge them into a new contig

3. Repeat step 2 until only one contig remains

Page 28: The Science of Information: From Communication to DNA Sequencing

Greedy algorithm: first error at overlap

A sufficient condition for reconstruction:

repeat

bridging read already merged

contigs

all repeats are bridged

L

Page 29: The Science of Information: From Communication to DNA Sequencing

`

log(# of -̀repeats)

Chromosome 22

GRCh37 Chr 22 (G = 35M)

greedyalgorithm

lower bound

Page 30: The Science of Information: From Communication to DNA Sequencing

longest interleaved repeatsat length 2248

lower bound

longest repeat at

Chromosome 19

GRCh37 Chr 19 (G = 55M)

log(# of -̀repeats)greedyalgorithm

non-interleaved repeatsare resolvable!

Page 31: The Science of Information: From Communication to DNA Sequencing

de Bruijn graph

ATAGCCCTAGCGAT

[Idury-Waterman 95][Pevzner et al 01]

(K = 4)

TAGC

AGCC

AGCG

GCCC

GCGA

CCCTCCTA

CTAG

ATAG

CGAT

1. Add a node for each K-mer in a read

2. Add edges for adjacent K-mers

Page 32: The Science of Information: From Communication to DNA Sequencing

Resolving non-interleaved repeats

non-interleaved repeat

Unique Eulerian path.

Page 33: The Science of Information: From Communication to DNA Sequencing

Resolving bridged interleaved repeats

interleaved repeat

bridging read

Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Page 34: The Science of Information: From Communication to DNA Sequencing

Resolving triple repeats

triple repeat

all copies bridged

neighborhood of triple repeat

all copies bridgedresolve repeat locally

Page 35: The Science of Information: From Communication to DNA Sequencing

Multibridging De-Brujin

Theorem:

Original sequence is reconstructable if:

2. interleaved repeats are (single) bridged

3. coverage

1. triple repeats are all-bridged

Necessary conditions for ANY algorithm:

1. triple repeats are (single) bridged

2. interleaved repeats are (single) bridged.

3. coverage.

(Bresler, Bresler & T. 12)

Page 36: The Science of Information: From Communication to DNA Sequencing

longest interleaved repeatsat length 2248

lower bound

longest repeat at

Chromosome 19

GRCh37 Chr 19 (G = 55M)

log(# of -̀repeats)

De-brujin algorithmclose to optimal

triple repeat

Page 37: The Science of Information: From Communication to DNA Sequencing

GAGE Benchmark Datasets

Staphylococcus aureus

i.i.d. fit data

Rhodobacter sphaeroidesG = 4,603,060 G = 2,903,081 G =88,289,540

Human Chromosome14

http://gage.cbcb.umd.edu/

Page 38: The Science of Information: From Communication to DNA Sequencing

Gap

Sulfolobus islandicus. G = 2,655,198

triple repeat lower bound

interleaved repeatlower bound

De-Brujinalgorithm

Page 39: The Science of Information: From Communication to DNA Sequencing

Read Noise

ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA

Illumina noise profile

Each symbol corrupted by a noisy channel.

Page 40: The Science of Information: From Communication to DNA Sequencing

Erasures on i.i.d. uniform DNA

Theorem:

If the erasure probability is less than 1/3, then noiseless performance can be achieved.

A separation architecture is optimal:

(Ma, Motahari, Ramchandran & T. 12)

errorcorrection

assembly

Page 41: The Science of Information: From Communication to DNA Sequencing

Why?

• Coverage means most positions are covered by many reads.

• Aligning noisy reads locally is easier than assembling noiseless reads globally for perasure < 1/3.

noise averaging

Page 42: The Science of Information: From Communication to DNA Sequencing

Conclusions

• A systematic approach to assembly design based on information.

• More powerful than just computational complexity considerations.

• Simple models are useful for initial insights but a data-driven approach yields a more complete picture.

Page 43: The Science of Information: From Communication to DNA Sequencing

Collaborators

Acknowledgments

Abolfazl Motahari

Guy Bresler

Ma’ayan Bresler

Nan Ma

Kannan Ramchandran

Yun Song Lior Pachter Serafim Batzoglou