79
Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel

Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Coding for DNA Storagein Live Organisms

Moshe Schwartz

Electrical & Computer EngineeringBen-Gurion University

Israel

Page 2: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Based on joint works with: (alphabetically)

• Jehoshua Bruck – Caltech

• Ohad Elishco – Ben-Gurion University (now MIT)

• Farzad Farnoud (Hassanzadeh) – University of Virginia

• Siddharth Jain – Caltech

• Yonatan Yehezkeally – Ben-Gurion University

Introduction 2 / 79

Page 3: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Science fiction distant future dream?

Introduction 3 / 79

Page 4: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

No – It’s just around the corner!

Introduction 4 / 79

Page 5: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

DNA is a long string

Genetic information is stored in DNA, which is astring of nucleotides: Adenine, Cytosine, Guanine,and Thymine.

In E. coli bacteria, genetic information is stored inabout 4 · 106 base pairs.In humans, genetic information is stored in over3 · 109 base pairs.

Introduction 5 / 79

Page 6: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Why store information in DNA?

DNA is dense!

It stores information in the molecular level.

DNA can potentially hold 250 · 250 bytes (250 peta-byte) of informationin 1 gram of DNA.

If we were to use 8Tb hard-drives to store the same amount, we’ll need32000 hard-drives, with a total weight of about 25 tons!

Introduction 6 / 79

Page 7: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

OK, but why in living organisms?

• Reading from DNA is destructive, hence we need several copies.Living organisms replicate and solve this problem.

• Data longevity is (potentially) better, due to replication oforganisms.

• The organism’s outer shell provides extra protection.

• Labeling organisms for biological studies.

• Watermarking genetically modified organisms (GMOs).

Main disadvantage:

Mutations!

We need error-correcting codes.

Introduction 7 / 79

Page 8: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Error-correcting codes – An age old story

An error-correcting code has two maincomponents:

1 An error ball: Its size and shape depend onthe kind of errors the channel induces.

2 A packing of error balls: Its density affectscommunication efficiency. Its structureaffects ease of encoding/decoding.

Introduction 8 / 79

Page 9: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

What kinds of errors do we expect?

Insertion Duplication

Substitution Deletionu v w

u w

u v w

u v w

u v′ w

u v w

u w

u v v w

Which is the most common? Unknown yet, but…

Introduction 9 / 79

Page 10: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Repeated sequences are everywhere

More than 50% of human genome is repeated sequences!1

Repetitions were shown to be connected with diseases such as cancer,myotonic dystrophy, Huntington’s disease, and important phenomenasuch as chromosome fragility, expansion diseases, silencing genes, andrapid morphological variation.

Repetitions are common in other species as well, and are claimed to bea major evolutionary force during vertebrate evolution.1

1Lander et al., Nature 2001.Introduction 10 / 79

Page 11: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Duplication processes may repeat

ACTCA⇓

ACTACTCA⇓

ACTATACTCA⇓

ACTATACACTCA

It is conceivable that a substantial portion of the unique genome, thepart that is not known to contain repeated sequences, also has itsorigins in ancient repeated sequences that are no longer recognizabledue to change over time.2

2Lander et al., Nature 2001.Introduction 11 / 79

Page 12: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Duplication processes may differ

Palindromic Duplication Interspersed Duplication

End Duplication Tandem Duplicationu v w

u v w

u v w

u v w z

u v w v

u v vR w

u v v w

u v w v z

Introduction 12 / 79

Page 13: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

A formal definition

Definition

Let Σ be a finite alphabet, s ∈ Σ∗ some string, and T ⊆ Σ∗Σ∗a set of

string-duplication rules. A string-duplication system, S, defined by thetuple (Σ, s, T ), is the reflexive transitive closure of T operating on s,namely, S ⊆ Σ∗ is the minimal set for which:

1 s ∈ S.

2 s′ ∈ S and T ∈ T imply T(s′) ∈ S.

We write S = S(Σ, s, T ).

Introduction 13 / 79

Page 14: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication - formally

Definition (End Duplication)

Tendi,k (x) =

{uvwv if x = uvw, |u| = i, |v| = k

x otherwise.

T endk =

{Tendi,k

∣∣∣ i > 0}.

The end-duplication system is defined as Sendk = S(Σ, s, T end

k ).

u v w

u v w v

Introduction 14 / 79

Page 15: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Tandem duplication - formally

Definition (Tandem Duplication)

Ttani,k (x) =

{uvvw if x = uvw, |u| = i, |v| = k

x otherwise.

T tank =

{Ttani,k

∣∣∣ i > 0}.

The tandem-duplication system is defined as Stank = S(Σ, s, T tan

k ).

u v w

u v v w

Introduction 15 / 79

Page 16: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

How expressive is a duplication system?

Definition

The capacity of a string system S ⊆ Σ∗ is defined by

cap(S) = lim supn→∞

log2 |S ∩ Σn|n

.

Definition

Let S ⊆ Σ∗ be a string system. We shall say S is fully expressive if forevery v ∈ Σ∗ there exist u,w ∈ Σ∗ such that uvw ∈ S.

Introduction 16 / 79

Page 17: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

We are interested in:

• How does the capacity depend on the choice of duplication rules?

• How does the capacity depend on the choice of seed string?

• Which systems are fully expressive?

• What is the connection between capacity and full expressiveness?

Introduction 17 / 79

Page 18: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Some related previous work exists

Tandem duplication was studied in the context of formal languages:

• Martín-Vide and Paun, Acta Cybernetica (1999):Where are tandem-duplication languages located in the Chomskyhierarchy?

• Dassow, Mitrana and Paun, Bull. of the EATCS (1999):Binary tandem-duplication languages are regular.

• Ming-Wei, Bull. of the EATCS (2000):Non-binary tandem-duplication languages are irregular.

Introduction 18 / 79

Page 19: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

More related previous work exists

Tandem duplication was studied in an algorithmic context:

• Main and Lorentz, J. Alg. (1984), Gusfield and Stoye, J. Comp. andSystems Sci. (2004):How to efficiently find tandem duplications in a string.

• Matroud, Hendy, and Tuffley, Nucleic Acids Research (2011):How to efficiently find nested tandem duplications.

• Elemento et al., Molecular Bio. and Evolution (2002), Lajoie et al.,J. Comp. Biology (2007), Brejová et al., Phil. Trans. R. Soc. A (2014):How to reconstruct the derivation process of a tandem-duplicatedstring.

Introduction 19 / 79

Page 20: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication has full capacity

Theorem

For Sendk = S(Σ, s, T end

k ), |s| > k,

cap(Sendk ) = log2 |Σ| .

AssumptionThe initial string s contains every symbols of Σ at least once.

End Duplication 20 / 79

Page 21: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication has full capacity (Cont.)

Proof.

6: We obviously have,

cap(Sendk ) = lim sup

n→∞

log2∣∣Send

k ∩ Σn∣∣

n

6 lim supn→∞

log2 |Σn|n

= log2 |Σ| .

End Duplication 21 / 79

Page 22: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication has full capacity (Cont.)

Proof.

>: We claim that starting with any string s ∈ Σ>k, with each symbolappearing at least once, and any w = w1w2 . . .wk ∈ Σk, we can derive astring y with w as a suffix.Step I: Duplicate prefix. Assume s = uv, |u| = k, then

s = uv ⇒ uvu = s′.

Observation: Every symbol of Σ appears in the beginning and end of ak-substring of s′.Step II: Force w1 at the end.

k

w1 w1⇒

End Duplication 22 / 79

Page 23: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication has full capacity (Cont.)

Proof.

Step III: Force w1w2 at the end.

k

w2 w1 w1 w2⇒

and then

k

w1w2 w1w2⇒

Repeat Step III inductively to get w1w2 . . .wk as a suffix.

End Duplication 23 / 79

Page 24: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

End duplication has full capacity (Cont.)

Proof.

Step IV: Repeat previous steps to get every k-word from Σk as asubstring.Thus, after at most 2k |Σ|k duplications we get a string s′′ containing allpossible k-substrings, |s′′| 6 2k2 |Σ|k.For any n = |s′′|+ tk we can now create |Σ|tk distinct strings. Hence,

cap(Sendk ) = lim sup

n→∞

log2∣∣Send

k ∩ Σn∣∣

n> lim sup

t→∞

log2(|Σ|tk)

|s′′|+ tk> log2 |Σ| .

Corollary

Sendk systems are fully expressive.

End Duplication 24 / 79

Page 25: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Tandem duplication behaves differently

But first…Main tool – φk-transform domain. We assume WLOG that Σ = Zq.

Definition

We define the transform φk : Z>kq → Zkq × Z∗

q by,

φk(x) = (Prefk(x), Suff|x|−k(x)− Pref|x|−k(x)),

as well as ζi,k : Zkq × Z∗

q → Zkq × Z∗

q,

ζi,k(x, y) =

{(x, u0kw) if y = uw, |u| = i

(x, y) otherwise,

where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.

Tandem Duplication 25 / 79

Page 26: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Main tool - φk-transform domain

Lemma

The following diagram commutes:

Z>kq

Ttani,k−−−−→ Z>kqyφk

yφk

Zkq × Z∗

qζi,k−−−−→ Zk

q × Z∗q

i.e., for every string x ∈ Z>kq ,

φk(Ttani,k (x)) = ζi,k(φk(x)).

Tandem Duplication 26 / 79

Page 27: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Main tool - φk-transform domain

ExampleAssume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to

02123Ttan1,2−−−−→ 0212123yφ2

yφ2

(02, 102)ζ1,2−−−−→ (02, 10002)

where the inserted elements are underlined.

Tandem Duplication 27 / 79

Page 28: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Tandem duplication behaves differently

Theorem

For Stank = S(Σ, s, T tan

k ), |s| > k, cap(Stank ) = 0.

Proof.

In the φk-transform domain, φk(s) = (x, y), and tandem duplicationbecomes an insertion of 0k in the y-part. Thus, a tandem duplicationoperation is equivalent to throwing k balls into a bin. There are atmost |y| = |s| − k+ 1 bins. Thus, after t tandem-duplicated operations,there are at most

(|s|−k+tt

)6 (|s| − k+ t)|s|−k outcomes. Thus,

cap(Stank ) 6 lim sup

t→∞

log2((|s| − k+ t)|s|−k)

|s|+ tk= 0

Tandem Duplication 28 / 79

Page 29: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Tandem duplication behaves differently

Corollary

Stank systems are never fully expressive.

Proof.

If φk(s) = (x, y), then all possible mutations are limited (in theφk-transform domain) to (x, y′) with y′ being the same as y except forextra zeros. Thus, φ−1

k (x, y1) can never be obtained from s.

Tandem Duplication 29 / 79

Page 30: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Were we too strict?

Definition (Tandem Duplication)

Ttani,k (x) =

{uvvw if x = uvw, |u| = i, |v| = k

x otherwise.

T tan>k =

{Ttani,k′

∣∣∣ i > 0, k′ > k}.

The lower-bounded tandem-duplication system is defined asStan>k = S(Σ, s, T tan

>k ).

u v w

u v v w

Tandem Duplication 30 / 79

Page 31: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Yes, we were! Here’s full expressiveness:

Theorem

Stan>k is fully expressive.

Proof.

Employ a similar procedure to generate each substring as in the prooffor Send

k , only each time copy a suffix of the string (from the chosenstarting point, to the end).

Tandem Duplication 31 / 79

Page 32: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

What about full capacity?

Theorem

For any finite alphabet Σ, and s ∈ Σ∗, we have

cap(Stan>1 ) > log2(r+ 1),

where r is the largest (real) root of the polynomial

f(x) = x|Σ| −|Σ|−2∑i=0

xi.

Proof Strategy: Find a set S ⊆ Stan>1 for which we can calculate the

capacity. But how?

Tandem Duplication 32 / 79

Page 33: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Regular languages to the rescue

Definition (Recipe for a regular language)

• A finite alphabet Σ

• A finite directed labeled graph G = (V, E, L), with E ⊆ V× V in themultiset sense, and L : E → Σ.

• A starting state s ∈ V and a set of accepting states A ⊆ V.

• If e1e2 . . . en is a directed path in G, it generates the wordL(e1)L(e2) . . . L(en).

• The language represented by G, denoted S(G), is defined as the setof all words generated by directed paths starting at s and endingin A.

Tandem Duplication 33 / 79

Page 34: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

A simple example for a regular language

ExampleConsider the following directed labeled graph G:

0

1

0

S(G) is the set of all binary strings where a 1 is followed by a 0.

Tandem Duplication 34 / 79

Page 35: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Graphs have properties

Definition

Let G = (V, E, L) be a graph generating a regular language.

• G is irreducible if for every v1, v2 ∈ V, there is a directed pathv1 v2.

• G is primitive if it is irreducible and the gcd of all cycle lengths is 1.

• G is lossless if for every v1, v2 ∈ V, and every word w ∈ Σ∗, there isat most one path v1 v2 that generates w.

Tandem Duplication 35 / 79

Page 36: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Counting paths is easy

Definition

For G = (V, E, L) define the adjacency matrix AG = (au,v) as the |V| × |V|matrix where au,v is the number of edges from u to v in G.

Observation

• The number of paths u v of length n is exactly (AnG)u,v.

• For a lossless graph G with one accepting state, i.e., A = {v}, wehave |S(G) ∩ Σn| = (AnG)s,v.

• Thus (with the above setting),

cap(S(G)) = lim supn→∞

log2((AnG)s,v)

n.

Tandem Duplication 36 / 79

Page 37: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Enter Perron and Frobenius

O. Perron

G. Frobenius(Source: Wikipedia)

Theorem (Perron-Frobenius (Partial))

If G is a primitive graph then:

1 λ = λ(AG) , max {|µ| : µ is an eigenvalue of AG} alsocalled the spectral radius of G, is an eigenvalue of AG .

2 There exist y, x > 0, unique (up to scalar multiplication)left and right eigenvectors for λ.

3 If y · xT = 1, then

limn→∞

1

λnAnG = xT · y.

Corollary

For a primitive lossless graph G, cap(S(G)) = log2(λ(AG)).

Tandem Duplication 37 / 79

Page 38: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Back to Stan>1

Proof.

Main Idea: Find a regular language that “resides” within Stan>1 and use

its capacity to lower bound cap(Stan>1 ).

Phase I: Denote the alphabet letters as a1, a2, . . . , a|Σ|. As in the proofof full expressiveness, assume we reach a string with a|Σ| . . . a2a1 as asuffix. From now on, we ignore everything except this suffix.Phase II: Run in iterations. In iteration i, where i = |Σ| , |Σ| − 1, . . . , 3, 2use tandem duplication only on strings of the form aiai−1 . . . a1. In thelast iteration, tandem duplicate single letters.It is easy to verify the resulting strings form the following regularlanguage,

S =

(a+|Σ|

(a+|Σ|−1

(. . .(a+2(a+1)+)+)+

)+)+

.

Tandem Duplication 38 / 79

Page 39: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof by sub-language (Cont.)

Proof.

S =

(a+|Σ|

(a+|Σ|−1

(. . .(a+2(a+1)+)+)+

)+)+

.

a|Σ| a|Σ|−1 a2 a1

a|Σ| a|Σ|−1 a2

a1

a1a1

Tandem Duplication 39 / 79

Page 40: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof by sub-language (Cont.)

Proof.

The graph is lossless, irreducible, and primitive. Its adjacency matrix is

AG =

1 1

1 11 1

. . .. . .1 1

1 1 1 . . . 1 1

,

Thus, the number of paths of length n from the starting vertex to theaccepting vertex grows exponentially as λn, where λ is the spectralradius of the graph, i.e., the largest root of

χAG (λ) = det(λI− AG) = (λ− 1)|Σ| −|Σ|−2∑i=0

(λ− 1)i.

Set x = λ− 1 and we obtain the result.

Tandem Duplication 40 / 79

Page 41: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

What do we have so far?

CapacityType System Zero Partial Full Full Expressiveness

EndSendk − − X XSend>k − − X X

TandemStank X − − −Stan>k − ? ? X

Open Question

Find cap(Stan>k ) or improve the bounds on it.

Tandem Duplication 41 / 79

Page 42: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Full capacity⇔ full expressiveness?

Theorem

Let S be a string system over the alphabet Σ. If S has full capacity then Shas full expressiveness.

Proof.

Assume to the contrary S never contains w ∈ Σk as a substring.Partition every word x ∈ S into blocks of length k (and perhaps aremainder block of length at most k− 1). Each block has at most|Σ|k − 1 choices, since w is forbidden. Thus,

|S ∩ Σn| 6 (|Σ|k − 1)bn/kc · |Σ|k−1 .

Thencap(S) 6 log2(|Σ|

k − 1)

k< log2 |Σ| .

Tandem Duplication 42 / 79

Page 43: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

What about the other direction?

ExampleConsider the following string system,

S = {vv | v ∈ Σ∗} .

It is obvious that S is fully expressive, but

cap(S) =1

2.

Open Question

This example is not a string-duplication system. What is theconnection between full capacity and full expressiveness forstring-duplication systems?

Tandem Duplication 43 / 79

Page 44: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

A bit more on the big picture…

CapacityType System Zero Partial Full Full Expressiveness

EndSendk − − X XSend>k − − X X

TandemStank X − − −Stan>k − ? ? X

Palindromic Spalk − ? ? X

Interspersed Sintk,k′ X X ? X

Open Question

Complete the missing pieces in this table.

Tandem Duplication 44 / 79

Page 45: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Let’s add probability to the mix

Why?

• Real biological processes are not always deterministic.

• Just like Shannon vs. Hamming: it is interesting!

Case study:

• Binary alphabet, Σ = {0, 1}, Duplication length k = 1.

• The position to duplicate is chosen independently and uniformly.• Two options:

• Stan1 – Tandem duplication: bit b becomes bb.

• Stan1 – Complement tandem duplication: bit b becomes bb.

Pólya String Models 45 / 79

Page 46: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Is this a Pólya urn model?

An urn contains B black balls and W white balls.At each step, a ball is extracted uniformly andindependently from the urn. The ball is returnedto the urn, together with another ball of thesame color. The process repeats.

Crucial difference:

There is no string structure in a Pólya urn model.

Pólya String Models 46 / 79

Page 47: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

How would we define capacity?

Let S(i) denote the random variable whose value is the string after imutations, and S(0) = s the seed string.

Definition

The probabilistic capacity of the process S is defined as

capProb(S) = lim supn→∞

1

nH(S(n)),

where H(S(n)) is the entropy of S(n), i.e.,

H(S(n)) = −∑w∈Σ∗

Pr(S(n) = w) log2 Pr(S(n) = w).

The combinatorial capacity will be denoted by capComb.

Pólya String Models 47 / 79

Page 48: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Not everything is uniformly distributed

Assume Stan1 with S(0) = 0:

n = 0 :

n = 1 :

n = 2 :

n = 3 :

01

1

011

21

0111

321

0101

231

0110

213

010

12

0110

312

0100

132

0101

123

Thus, Pr(S(3) = 0110) = 13 but Pr(S(3) = 0111) = 1

6 .

Pólya String Models 48 / 79

Page 49: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

One simple connection exists

Lemma

For S ∈{Stan1 , Stan

1

}, capProb(S) 6 capComb(S).

Proof.

H(S(n)) is maximized when S(n) is uniformly distributed,

H(S(n)) 6 log2

∣∣∣S ∩ Σ|S(0)|+n∣∣∣ .

Thus,

capProb(S) = lim supn→∞

1

nH(S(n))

6 lim supn→∞

1

nlog2

∣∣∣S ∩ Σ|S(0)|+n∣∣∣ = capComb(S).

Pólya String Models 49 / 79

Page 50: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

So for tandem duplication…

Corollary

For any S(0) we havecapProb(S

tan1 ) = 0.

Proof.

We obviously have capProb(Stan1 ) > 0. Additionally,

capProb(Stan1 ) 6 capComb(S

tan1 ) = 0,

which we already proved.

Pólya String Models 50 / 79

Page 51: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Complement-tandem duplication is harder

Assume S(0) = 0 for simplicity. Let us record the history of mutationsin a string, whose ith position equals j if the jth mutation caused theith symbol.

Example 0 → 01 → 010 → 0110,ε → 1 → 12 → 312.

Observation

1 History is a permutation.

2 Each permutation is equally likely.

Pólya String Models 51 / 79

Page 52: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Here it is again

Assume Stan1 with S(0) = 0:

n = 0 :

n = 1 :

n = 2 :

n = 3 :

01

1

011

21

0111

321

0101

231

0110

213

010

12

0110

312

0100

132

0101

123

Some histories results in the same mutated string.

Pólya String Models 52 / 79

Page 53: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

It’s all in the signature

Definition

The signature of a permutation π ∈ Sn, is a binary stringw = w1w2 . . .wn−1, where

wi =

{0 π(i) > π(i+ 1),

1 π(i) < π(i+ 1).

Theorem

Consider Stan1 with S(0) = 0. Then Pr(S(n) = 01w) is the same as the

probability of getting the signature w when choosing a permutation fromSn (uniformly).

Pólya String Models 53 / 79

Page 54: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

It’s all in the signature – Proof

Proof.

Assuming w ∈ {0, 1}n−1, some notation first:

1 Π01w – The set of history permutations that lead to a mutatedstring 01w.

2 Ψw – The set of permutations from Sn with signature w.

3 For any string v ∈ {0, 1}`, the set of positions where 0 is precededby a 1 (including possible edges)

Tv = {i ∈ [`+ 1] : (vi−1 = 1 or i = 1) and (vi = 0 or i = `+ 1)}

Example: for v = 0011010 we have Tv = {1, 5, 7}.

Pólya String Models 54 / 79

Page 55: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

It’s all in the signature – Proof (Cont.)

Proof.

Strategy: Prove |Π01w| = |Ψw| by showing both expressions have thesame recursion with the same starting conditions.

Starting conditions: Trivially |Π01ε| = |Ψε| = 1.

Recursion for Ψw: Given w ∈ {0, 1}n−1, we can recursively construct apermutation π ∈ Sn with signature w by picking π−1(n), which canonly be some i ∈ Tw. We then recursively construct two permutations,with signatures w1...i−2 and wi...n−1. Thus,

|Ψw| =∑i∈Tw

(n− 1

i− 1

) ∣∣Ψw1...i−2

∣∣ ∣∣Ψwi+1...n−1

∣∣ .

Pólya String Models 55 / 79

Page 56: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

It’s all in the signature – Proof (Cont.)

Proof.

Recursion for Π01w: Given w ∈ {0, 1}n−1, consider a historypermutation π ∈ Sn resulting in the mutated sequence 01w. Obviouslyπ−1(1) is a position of a bit 1 in 01w which is last in a run, i.e.,followed by a 0 or last in the string. Thus, pick π−1(1), and constructthe rest of the permutation recursively using w1...i−2 and wi...n−1. Thus,

|Π01w| =∑i∈Tw

(n− 1

i− 1

) ∣∣Π01w1...i−2

∣∣ ∣∣Π10wi+1...n−1

∣∣ .

Pólya String Models 56 / 79

Page 57: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Last time I’m showing this slide

Assume Stan1 with S(0) = 0:

n = 0 :

n = 1 :

n = 2 :

n = 3 :

01

1

011

21

0111

321

0101

231

0110

213

010

12

0110

312

0100

132

0101

123

Open Question

Find a nice bijection between Π01w and Ψw.

Pólya String Models 57 / 79

Page 58: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

And now, the capacity

Theorem

For Stan1 with S(0) = 0,

0.7213 ≈ log2(e)2

6 capProb(Stan1 ) 6 H2

(1

3

)≈ 0.9183,

where H2(x) , −x log2(x)− (1− x) log2(1− x) is the binary entropyfunction.

Pólya String Models 58 / 79

Page 59: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof of the bounds

Proof.

Consider the real random variables X1, X2, . . . , chosen i.i.d., uniformlyfrom [0, 1]. Sorting Xn1 , X1, X2, . . . , Xn generates a random permutation(due to symmetry, chosen uniformly from Sn).Define

Qi ,{1 Xi < Xi+1

0 Xi > Xi+1,

(except for a 0-measure undefined set). So Qn−11 is a signature of a

uniformly chosen random permutation from Sn.

Pólya String Models 59 / 79

Page 60: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof of the bounds (Cont.)

Proof.

We now havePr(S(n) = 01w) = Pr(Qn−1

1 = w).

and

capProb(Stan1 ) = lim sup

n→∞

1

nH(S(n)) = lim sup

n→∞

1

nH(Qn−1

1 )

= lim supn→∞

1

n

n−1∑i=1

H(Qi | Qi−11 ).

Pólya String Models 60 / 79

Page 61: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof of the bounds (Cont.)

Proof.

Lower bound: Since Qi−11 → Xi → Qi we have

H(Qi | Qi−11 ) > H(Qi | Xi).

Furthermore, Pr(Qi = 0 | Xi = x) = x. Thus,

capProb(Stan1 ) = lim sup

n→∞

1

n

n−1∑i=1

H(Qi | Qi−11 )

> lim supn→∞

1

n

n−1∑i=1

H(Qi | Xi)

= H(Q1 | X1) =∫ 1

0H2(x)dx =

log2(e)2

.

Pólya String Models 61 / 79

Page 62: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof of the bounds (Cont.)

Proof.

Upper bound:

capProb(Stan1 ) = lim sup

n→∞

1

n

n−1∑i=1

H(Qi | Qi−11 ) 6 lim sup

n→∞

1

n

n−1∑i=2

H(Qi | Qi−1)

= H(Q2 | Q1) =1

2(H(Q2 | Q1 = 0) + H(Q2 | Q1 = 1))

= H2

(1

3

),

since

Pr(Q2 = 0 | Q1 = 0) =

∫ 10 dx1

∫ x10 dx2

∫ x20 dx3∫ 1

0 dx1∫ x10 dx2

=1/6

1/2=

1

3,

and similarly for Pr(Q2 = 1 | Q1 = 1).

Pólya String Models 62 / 79

Page 63: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Probabilistic 6= Combinatorial

Observation

capProb(Stan1 ) 6 H2

(1

3

)< 1 = capComb(S

tan1 ).

Open Questions

1 Find capProb(Stan1 ).

2 We know nothing for duplication length > 2.

Pólya String Models 63 / 79

Page 64: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Moving on to error correction

An error-correcting code has two maincomponents:

1 An error ball: Its size and shape depend onthe kind of errors the channel induces.

2 An error ball: Its size and shape depend onthe kind of errors the channel induces.

3 A packing of error balls: Its density affectscommunication efficiency. Its structureaffects ease of encoding/decoding.

Error-Correcting Codes 64 / 79

Page 65: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Let us recall the scenario

• Information is stored in the DNA of some bacteria.

• The bacteria mutate over time.

• When the information is read, the DNA has gone through a(perhaps unbounded) number of duplications.

Goal

Protect information against duplication errors!

Case studyWe focus on Stan

k – tandem duplication with fixed duplication length k.

Error-Correcting Codes 65 / 79

Page 66: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Some definitions are required

Definition

If v ∈ Stank (Σ, u, T tan

k ), we denote it as u=⇒∗k v. We

say u is an ancestor of v, and v is a descendant ofu. We define the descendant cone of u as

D∗k(u) =

{v ∈ Σ∗ : u

∗=⇒k

v},

and the ancestor cone as

A∗k(u) ={v ∈ Σ∗ : v

∗=⇒k

u}.

A∗k (u)

D∗k (u)

u

Time

Error-Correcting Codes 66 / 79

Page 67: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Now we define a code

Definition

An (n,M; ∗)k code C is a subset C ⊆ Σn of size |C| = M, such that foreach u, v ∈ C, u 6= v,

D∗k(u) ∩ D∗

k(v) = ∅.

The decoding problem

Given an (n,M; ∗)k code C, and a (mutated) word v ∈ Σ∗, find

Decode(v) = A∗k(v) ∩ C.

Error-Correcting Codes 67 / 79

Page 68: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Reminder – The φk-transform

We assume WLOG that Σ = Zq.

Definition

We define the transform φk : Z>kq → Zkq × Z∗

q by,

φk(x) = (Prefk(x), Suff|x|−k(x)− Pref|x|−k(x)),

as well as ζi,k : Zkq × Z∗

q → Zkq × Z∗

q,

ζi,k(x, y) =

{(x, u0kw) if y = uw, |u| = i

(x, y) otherwise,

where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.

Error-Correcting Codes 68 / 79

Page 69: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Main tool - φk-transform domain

Lemma

The following diagram commutes:

Z>kq

Ttani,k−−−−→ Z>kqyφk

yφk

Zkq × Z∗

qζi,k−−−−→ Zk

q × Z∗q

i.e., for every string x ∈ Z>kq ,

φk(Ttani,k (x)) = ζi,k(φk(x)).

Error-Correcting Codes 69 / 79

Page 70: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Main tool - φk-transform domain

ExampleAssume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to

02123Ttan1,2−−−−→ 0212123yφ2

yφ2

(02, 102)ζ1,2−−−−→ (02, 10002)

where the inserted elements are underlined.

Error-Correcting Codes 70 / 79

Page 71: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

The ancestors are the key component

A∗k (u)

D∗k (u)

u

Time

Rk(u)

Definition

If A∗k(v) = {v} we say v is irreducible. The set ofirreducible words is denoted Irrk. The roots of v ∈ Σ∗

are defined by Rk(v) = A∗k(v) ∩ Irrk.

Lemma

For tandem duplication of length k, and every v ∈ Σ∗,|Rk(v)| = 1.

Already proved by Leupold et al. (2005). We give adifferent proof, using φk, enabling a code construction.

Error-Correcting Codes 71 / 79

Page 72: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Proof of root uniqueness

Proof.

Denote φk = (x, y), and y = 0m0y10m1y20m2 . . . 0mt−1yt0mt , where yi 6= 0for all i. Any ancestor v′ ∈ A∗k(v) must be of the form,

φk(v′) = (x, 0m0−i0ky10m1−i1ky20m2−i2k . . . 0mt−1−it−1kyt0mt−itk),

and it is irreducible if and only if

φk(v′) = (x, 0m0 mod ky10m1 mod ky20m2 mod k . . . 0mt−1 mod kyt0mt mod k),

giving a unique root.

Error-Correcting Codes 72 / 79

Page 73: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Disjoint descendant cones are simple

Corollary

D∗k(u) ∩ D∗

k(v) 6= ∅ if and only if Rk(u) = Rk(v).

Proof.

⇒: If w ∈ D∗k(u) ∩ D∗

k(v) then

Rk(u)∗=⇒k

u∗=⇒k

w and Rk(v)∗=⇒k

v∗=⇒k

w,

and since the root of w is unique, Rk(u) = Rk(v).

Error-Correcting Codes 73 / 79

Page 74: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Disjoint cones proof (Cont.)

Proof.

⇐: If Rk(u) = Rk(v) then denote

φk(Rk(u)) = φk(Rk(v)) = (x, 0m0y10m1y20m2 . . . 0mt−1yt0mt).

Then,

φk(u) = (x, 0m′0y10m

′1y20m

′2 . . . 0m

′t−1yt0m

′t)

φk(v) = (x, 0m′′0 y10m

′′1 y20m

′′2 . . . 0m

′′t−1yt0m

′′t ).

Define w ∈ Σ∗ such that

φk(w) = (x, 0max(m′0,m

′′0 )y10max(m′

1,m′′1 ) . . . 0max(m′

t−1,m′′t−1)yt0max(m′

t,m′′t )),

which immediately shows u=⇒∗k w and v=⇒∗

k w.

Error-Correcting Codes 74 / 79

Page 75: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Putting it all together

Theorem

• v ∈ Irrk iff φk(v) = (x, y) and y is (0, k− 1)-RLL.

• Irrk ∩Σn is an (n,M; ∗)k-code.• Decoding v ∈ Σ∗ may be done in linear time by:

1 Finding φk(v) = (x, y).2 Reducing runs of 0’s in y modulo k to obtain y′.3 Returning the answer φ−1

k (x, y′).

Observation

The code may be further enlarged (and made optimal!) by carefullyadding shorter RLL sequences.

Error-Correcting Codes 75 / 79

Page 76: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Other results

• Stan63 : forms a regular language, unique root, positive (though notfull) capacity, not fully expressive.3

• Unique root exists in several other cases, enabling codeconstruction and decoding.3

Theorem

Let Σ 6= ∅ be an alphabet, and U ⊆ N, U 6= ∅, a set of tandem-duplicationlengths. Denote k = min(U). Then (Σ,U) is a unique-root pair if and onlyif it matches one of the following cases:

|Σ| = 1 U ⊆ kN

|Σ| = 2U = {k}U ⊇ {1, 2}

|Σ| > 3U = {k}U = {1, 2}U = {1, 2, 3}

3Jain et al., IEEE Trans. on Inform. Th. 2017.Conclusion 76 / 79

Page 77: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Other results

• What is the longest duplication distance to the root (inunbounded tandem duplication)? Apparently for length nsequences it is Θ(n) in the worst (and common!) case.4

• In the probabilistic models we know also the capacity of endduplication, as well as a mix duplication and complementduplication – but only for duplication length k = 1.5

• Tandem duplication with point-mutation (substitution) has morecapacity and expressiveness, but requires more care whenconstructing error-correcting codes.6

4Alon et al., ISIT 2016.5Elishco et al., ISIT 2016.6Jain et al., ISIT 2017.

Conclusion 77 / 79

Page 78: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Many open questions remain!

Open Questions

• Study error-correcting codes for duplication models other thantandem duplication.

• Find error-correcting codes for a probabilistic channel, correctingtypical errors.

• Study a mix of duplication and other mutations (substitutions,insertions/deletions).

• Study error models which are context sensitive.

• For the biologists: Find out the channel parameters in the realworld.

Conclusion 78 / 79

Page 79: Coding for DNA Storage in Live Organisms - Unicamp · (Source:Wikipedia) Theorem(Perron-Frobenius(Partial)) IfG isaprimitivegraphthen: 1 = (AG) ,maxfj j: isaneigenvalueofAGg also

Thank You

Conclusion 79 / 79