37
The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August 31, 2012 Research supported by NSF Center for Science of Information. Joint work with A. Motahari, G. Bresler, M. Bresler. Acknowledgements: S. Batzolgou, L. Pachter, Y. Song

The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

The Science of Information:

From Communication to DNA Sequencing

David Tse

Dept. of EECS

U.C. Berkeley

Telecom Paris

August 31, 2012

Research supported by NSF Center for Science of Information.

Joint work with A. Motahari, G. Bresler, M. Bresler.

Acknowledgements: S. Batzolgou, L. Pachter, Y. Song

Page 2: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Communication: the beginning

• Prehistoric: smoke signals, drums.

• 1837: telegraph

• 1876: telephone

• 1897: radio

• 1927: television

Communication design tied to the specific source and

specific physical medium.

Page 3: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Grand Unification

channel capacity C bits/ sec

source entropy rate H bits/ source sym

Shannon 48

Theorem:

max. rate of reliable communicat ion

=C

Hsource sym / sec.

Model all sources and channels statistically.

A unified way of looking at all communication problems in terms of

information flow.

source

reconstructed

source

Page 4: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

60 Years Later

• All communication systems are designed based on

the principles of information theory.

• A benchmark for comparing different schemes and

different channels.

• Suggests totally new ways of communication.

Page 5: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Secrets of Success

• Information , then computation.

It took 60 years, but we got there.

• Simple models, then complex.

The discrete memoryless channel

………… is like the Holy Roman Empire.

• Infinity, and then back.

Allow us to think in terms of typical behavior.

Page 6: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Tom Cover, 1990

“Asymptotic limit is the first term in the Taylor series expansion

at infinity.

And theory is the first term in the Taylor series of practice.”

Page 7: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Looking Forward

Can the success of this way of thinking be broadened

to other fields?

Page 8: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Information Theory of DNA Sequencing

Page 9: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

DNA sequencing

DNA: the blueprint of life

Problem: to obtain the sequence of nucleotides.

…ACGTGACTGAGGACCGTG

CGACTGAGACTGACTGGGT

CTAGCTAGACTACGTTTTA

TATATATATACGTCGTCGT

ACTGATGACTAGATTACAG

ACTGATTTAGATACCTGAC

TGATTTTAAAAAAATATT…

courtesy: Batzoglou

Page 10: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Impetus: Human Genome Project

1990: Start

2001: Draft

2003: Finished 3 billion nucleotides

courtesy: Batzoglou

3 billion $$$$

Page 11: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Sequencing gets cheaper and faster

Cost of one human genome

• HGP: $ 3 billion

• 2004: $30,000,000

• 2008: $100,000

• 2010: $10,000

• 2011: $4,000

• 2012-13: $1,000

• ???: $300

Time to sequence one genome: years days

Massive parallelization.

courtesy: Batzoglou

Page 12: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

But many genomes to sequence

100 million species

(e.g. phylogeny)

7 billion individuals

(SNP, personal genomics)

1013 cells in a human

(e.g. somatic mutations

such as HIV, cancer) courtesy: Batzoglou

Page 13: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Whole Genome Shotgun Sequencing

Reads are assembled to reconstruct the original DNA sequence.

Page 14: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

A Gigantic Jigsaw Puzzle

Page 15: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Many Sequencing Technologies

• HGP era: single technology

(Sanger)

• Current: multiple “next generation”

technologies (eg. Illumina, SoLiD,

Pac Bio, Ion Torrent, etc.)

• Each technology has different

read lengths, noise profiles, etc

Page 16: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Many assembly algorithms

Source:

Wikipedia

Page 17: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

And many more…….

A grand total of 42!

Why is there no single optimal algorithm?

Page 18: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

A basic question

• What is the minimum number of reads required for

reliable reconstruction?

• How much intrinsic information does each read

provide about the DNA sequence?

• A benchmark for comparing different algorithms and

different technologies.

• An open question!

Page 19: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Coverage Analysis

• Pioneered by Lander-Waterman

• What is the minimum number of reads to ensure there is

no gap between the reads with a desired prob.?

• Only provides a lower bound on the minimum number of

reads to reconstruct.

• Clearly not tight.

Page 20: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Communication and Sequencing:

An Analogy

Communication:

Sequencing:

Question: what is the max. sequencing rate such that

reliable reconstruction is asymptotically possible?

source

sequence

max. communicat ion rate R =Cchannel

Hsource

source sym / sec.

sequencing rate R =G

NDNA sym / read

Page 21: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

A Basic Model

• DNA sequence: i.i.d. with marginal distribution

p=(p1, p2, p3, p4).

• Starting positions of reads: i.i.d. uniform on the DNA

sequence.

• Read process: noiseless.

Will build on this to look at statistics from genomic data.

Page 22: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

The read channel

• Capacity depends on

– read length: L

– DNA length: G

• Normalized read length:

• Eg. L = 100, G = 3 £ 109 :

read

channel AGCTTATAGGTCCGCATTACC AGGTCC

¹L :=L

logG

¹L = 4:6

L

G

Page 23: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Result: Sequencing Capacity

Renyi entropy of order 2

C = 0

C = ¹L

The higher the entropy,

the easier the problem!

Page 24: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Complexity is in the eye of the beholder

Low entropy High entropy

harder to compress

easier jigsaw puzzle harder jigsaw puzzle

easier to compress

Page 25: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Capacity Result Explained

C = 0

C = ¹L

coverage constraint

N > G=L logG

Page 26: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

A necessary condition for reconstruction

interleaved repeats of length

None of the copies is straddled by a read (unbridged).

Reconstruction is impossible!

Special cases:

Under i.i.d. model, greedy is optimal for mixed reads.

Page 27: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

A sufficient condition: greedy algorithm

Input: the set of N reads of length L

1. Set the initial set of contigs as the reads

2. Find two contigs with largest overlap and merge

them into a new contig

3. Repeat step 2 until only one contig remains

Algorithm progresses in stages:

at stage

contig

merge reads at overlap

overlap

Page 28: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Greedy algorithm: stage

merge mistake => there must be a -repeat

and

each copy is not bridged by a read.

A sufficient condition for reconstruction:

There is no unbridged - repeat for any .

repeat

bridging read already merged

` `

contigs

Page 29: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Summary

Necessary condition for reconstruction:

No unbridged interleaved -repeats for any .

Sufficient condition (via greedy algorithm)

No unbridged -repeats for any .

For the i.i.d. DNA model:

1) If there are no unbridged interleaved repeats, then

w.h.p. there are no unbridged repeats.

2) The probability is dominated when either

Page 30: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Capacity Result Explained

C = 0

C = ¹L

interleaved (L-1)- repeats

L-1 L-1 L-1 L-1

coverage constraint

Page 31: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Summary: Two Regimes

coverage-limited

regime repeat-limited

regime

Question:

Is this clean state of affairs tied to the i.i.d. DNA model?

Page 32: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

I.I.D. DNA vs real DNA

• Mammalian DNA has many long repeats.

• How will the greedy algorithm perform for general

DNA statistics?

• Will there be a clean decomposition into two

regimes?

Page 33: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Greedy algorithm: general DNA statistics

• Reconstruction if there are no unbridged repeats.

• Performance depends on the DNA statistics through the

number of - repeats:

• Necessary condition translates similarly to an upper

bound on capacity:

`

? ?

Page 34: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

2 4 6 8 10 12

11.5

12

12.5

13

13.5

14

14.5

15

15.5

16

16.5

2 4 6 8 10 12 14

10

11

12

13

14

15

16

20 40 60 80 100 120 140 160

0

2

4

6

8

10

12

14

16

0 500 1000 1500 20000

5

10

15

`

Rgreedy (L)log(# of `-repeats)

i.i.d. data

0 500 1000 1500 2000 2500 3000 3500 4000 45000

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000 45000

50

100

150

200

250

upper bound

2200 2250 2300 2350 2400 24500

20

40

60

80

100

120

140

longest repeat

at

upper bound

I.I.D. DNA vs real DNA

GRCh37 Chr 22 (G = 35M)

longest interleaved repeats

at 2325

Page 35: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

0 1000 2000 3000 4000 5000 60000

50

100

150

200

250

300

350

Rgreedy (L)

0 1000 2000 3000 40000

5

10

15

longest interleaved repeats

at length 2248

upper bound

longest repeat

at

Chromosome 19

GRCh37 Chr 19 (G = 55M)

log(# of `-repeats)

There is another more sophisticated algorithm that would close the

gap and in fact near optimal on all 22 chromosomes.

Page 36: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Ongoing work

• Noisy reads

• Reference-based assembly.

• Partial reconstruction.

……...

Page 37: The Science of Information: From Communication to DNA ... · The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley Telecom Paris August

Conclusion

• Information theory has made a huge impact on

communication.

• Its success stems from focusing on something

fundamental: information.

• This philosophy may be useful for other important

engineering problems.

• DNA sequencing is a good example.