Portable and Random-Access DNA-Based Storage Systemstce.webee.eedev.technion.ac.il/wp-content/uploads/sites/8/2017/07/... · MotivationImplementationsEnd-to-End DesignTHANK YOU! Portable

Motivation Implementations End-to-End Design THANK YOU!

Portable and Random-Access DNA-Based Storage Systems

Olgica Milenkovic

A joint work with R. Gabrys, E. Garcia Ruiz, H. M. Kiah, J. Ma, G. J. Puleo, M.

Riedel, D. Soloveichik, H. M. Tabatabaei Yazdi, Y. Yuan, E. Yaakobi and H. Zhao

Henry Taub TCE Conference

June 2017


Motivation


DNA as Storage Media

DNA stores the “information of life” - the most massive datasource on Earth.

▸ DNA is extremely durable: Can still “read” DNA from mammoths,Neanderthals, and 700,000 old horses!

▸ DNA information content of Human cell: 6.4 GB. Mass of a cell: ∼ 3picograms. No. of cells: 15 − 40 × 1012.

▸ Can one store information in DNA?

▸ This question has been raised before: “There is plenty of room at thebottom,” R. Feynman.























We can write...

We can “write” in DNA using what is called the process of DNA Synthesis.

Biochemistry of synthesis: Stitching together bases from the set {A,T,G,C}through deprotection & coupling cycles.

Coupling

3

Capping

Oxidation

De-blocking

6

5’

3’

Final deprotection and cleavage from the support

Initial deprotection

nucleoside2-cyanoethyl phosphoramidite

Blocking unreacted 5’-OH groups

Base 2O

O

O

DMTr

O NP

CH3

CH2CH2NC

CH3

CH3

CH3

5’

3’1

Base 1O

O

O

DMTr

AAAA

5’

3’2

Base 1O

OH

O

AAAA

5’

3’4

Base 1O

OP

O

A

Base 2O

O

O

DMTr

OCH2CH2NC

AAAAA

53’

5’ Base 1O

O

O

A

O

CH3

AAAA

O

Base 1O

O

P

O

A

Base 2O

O

O

DMTr

OCH2CH2NC

AAAA

Final oligonucleotide

5’

3’7

O

Base 1O

O

P

OH

Base 2O

O

O

OCH2CH2NC

OP

O

OCH2CH2NC

O

OP

Base nO

OH

O

OCH2CH2NC


We can write...

We can “write” in DNA using what is called the process of DNA Synthesis.

Biochemistry of synthesis: Stitching together bases from the set {A,T,G,C}through deprotection & coupling cycles.

Coupling

3

Capping

Oxidation

De-blocking

6

5’

3’

Final deprotection and cleavage from the support

Initial deprotection

nucleoside2-cyanoethyl phosphoramidite

Blocking unreacted 5’-OH groups

Base 2O

O

O

DMTr

O NP

CH3

CH2CH2NC

CH3

CH3

CH3

5’

3’1

Base 1O

O

O

DMTr

AAAA

5’

3’2

Base 1O

OH

O

AAAA

5’

3’4

Base 1O

OP

O

A

Base 2O

O

O

DMTr

OCH2CH2NC

AAAAA

53’

5’ Base 1O

O

O

A

O

CH3

AAAA

O

Base 1O

O

P

O

A

Base 2O

O

O

DMTr

OCH2CH2NC

AAAA

Final oligonucleotide

5’

3’7

O

Base 1O

O

P

OH

Base 2O

O

O

OCH2CH2NC

OP

O

OCH2CH2NC

O

OP

Base nO

OH

O

OCH2CH2NC


We Can Write...

Commercial synthesis: Agilent, Gen9, IDT, Twist Bioscience (recent feature on“Rewriting DNA Synthesis on Silicon”).

▸ DNA microarray based short string - oligo - synthesis (left): Cost effective,large scale. Moderate error rates.

▸ Long strand - gBlock - synthesis (right): Assembles short blocks.Chemical error-correction.

▸ Types of synthesis errors: Deletions, insertions, substitutions.


We Can Write...






We Can Write...






We Can Read...

Illumina (MiSeq): Best overall performance of modern sequencing technologiesin terms of yield and accuracy. Relatively small error rates (substitutions andrare deletions). Short read length.

Steps: Cloning /// Shearing /// Reading of unordered pool /// Computeraided alignment of overlapping fragments /// Consensus


We Can Read...

Illumina (MiSeq): Best overall performance of modern sequencing technologiesin terms of yield and accuracy. Relatively small error rates (substitutions andrare deletions). Short read length.

Steps: Cloning /// Shearing /// Reading of unordered pool /// Computeraided alignment of overlapping fragments /// Consensus


We Can Read...

Oxford Nanopore - MinIon: Longer read lengths, portable architecture.Context-dependent deletion, insertion and substitution errors.

Key properties: Biological pore(s) and motor, base calling using deep learningtechniques.


We Can Read...

Oxford Nanopore - MinIon: Longer read lengths, portable architecture.Context-dependent deletion, insertion and substitution errors.

Key properties: Biological pore(s) and motor, base calling using deep learningtechniques.


We Can Amplify and Enable Random Access...

Polymerase Chain Reaction (PCR): Cheap, fast, “exponential” informationreplication.

Primers - Key enablers of PCR: Short DNA strands that initiate replication at“strand-matching” locations (red blocks).










Implementations


“Double Helix Serves Double Duty”, NY Times, Jan 2013

▸ Church et al. (Science, 2012) and later Goldman et al. (Nature, 2013)stored 739 KB of data in synthetic DNA, mailed it and recreated theoriginal digital files using Illumina readers.

▸ Digital archival storage systems that will safely store the equivalent of onemillion CDs in a gram of DNA for 10,000 years.

100

D - DNA fragments

1 100 2 12

1 1

25

Encoded data

Reverse complemented encoded data

File identification

Intra-file location information

Parity-check

Reversed or non-reversed flag

A - Binary/text

B - Base-3-encoded

C - DNA-encoded

B e n a m k h o d a v a n d … …

…10001001111001010110110…

20112 20200 02110 10002 02212 01112 02110 10221 02212 11021 02212 10101 02212 10002 … …

TCACT ATATA TGTGA CGATA TAGTA TGTGC GCACG TCTAC GCTGC ACGCA TAGTA CGTCA TAGTA CGATA … …

Alternate fragments have file information reverse complemented


“Double Helix Serves Double Duty”, NY Times, Jan 2013

▸ Church et al. (Science, 2012) and later Goldman et al. (Nature, 2013)stored 739 KB of data in synthetic DNA, mailed it and recreated theoriginal digital files using Illumina readers.

▸ Digital archival storage systems that will safely store the equivalent of onemillion CDs in a gram of DNA for 10,000 years.

100

D - DNA fragments

1 100 2 12

1 1

25

Encoded data

Reverse complemented encoded data

File identification

Intra-file location information

Parity-check

Reversed or non-reversed flag

A - Binary/text

B - Base-3-encoded

C - DNA-encoded

B e n a m k h o d a v a n d … …

…10001001111001010110110…

20112 20200 02110 10002 02212 01112 02110 10221 02212 11021 02212 10101 02212 10002 … …

TCACT ATATA TGTGA CGATA TAGTA TGTGC GCACG TCTAC GCTGC ACGCA TAGTA CGTCA TAGTA CGATA … …

Alternate fragments have file information reverse complemented


“Data Storage on DNA Can Keep it Safe for Centuries,”NY Times, Dec 2015

Our Work: A fully operational random access and rewritable DNA-basedmemory with Sanger sequencing.

▸ Yazdi et.al., 2015 - First random access, rewritable DNA-based storagesystem. Encoded Wikipedia entries for six US universities.

▸ Large scale testing of the methods: Microsoft Research/Twist Bioscience,2016. Coverage by Spectrum, Nature, New Scientist etc.


“Data Storage on DNA Can Keep it Safe for Centuries,”NY Times, Dec 2015

Our Work: A fully operational random access and rewritable DNA-basedmemory with Sanger sequencing.

▸ Yazdi et.al., 2015 - First random access, rewritable DNA-based storagesystem. Encoded Wikipedia entries for six US universities.

▸ Large scale testing of the methods: Microsoft Research/Twist Bioscience,2016. Coverage by Spectrum, Nature, New Scientist etc.


Practical Obstacles

▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.

▸ Use native instead of synthetic DNA for encoding.

▸ Readout devices such as Illumina are very expansive and bulky.

▸ Use portable and cheap nanopore sequencers such as MinION.

▸ No known interfaces between DNA-based memories and computationalplatforms.

▸ Use generalizations of strand displacement to implement Minsky’s register

machines.


Practical Obstacles







machines.


Practical Obstacles







machines.


Practical Obstacles







machines.


Practical Obstacles







machines.


Practical Obstacles







machines.


End-to-End Design


Writing on DNA

▸ File selection, (JPEG) compression, Base 64 conversion.

▸ Constrained and Error-Control Coding, Addressing.

▸ Synthesis with IDT (Integrated DNA Technologies).

JPEG

compression

Base64

conversion

Encoding into

DNA

DNA

synthesis

TTACCGCCTCCCCCCA

984 bases 16 bases

Address Information

Raw image

DNA blocks

AGTCTGAC ACTGTCTC ⋯ TGCTTGCA

JPEG and Base64 Encoded Image100010111001010…

Conversion into a sequence over the alphabet {A,T,G,C}AATGCAGCGTTTAGAGAT…

AATGCAGC|GTTTAGAG|AT…Parsing into blocks of length 896

Addressing

AATGCCCAGCATAG GTTTCGAGAGCTCTBlock GC-Content Balancing and ECC

CGAAAATGCCCAGCATAG GCTAGTTTCGAGAGCTCT


Writing on DNA




JPEG

compression

Base64

conversion

Encoding into

DNA

DNA

synthesis

TTACCGCCTCCCCCCA

984 bases 16 bases

Address Information

Raw image

DNA blocks





Addressing




Writing on DNA




JPEG

compression

Base64

conversion

Encoding into

DNA

DNA

synthesis

TTACCGCCTCCCCCCA

984 bases 16 bases

Address Information

Raw image

DNA blocks





Addressing




Why Balancing?

▸ 50% GC-content. Evens out melting temperature, creates more stable DNA.

▸ Makes synthesis simpler!


Why Balancing?

▸ 50% GC-content. Evens out melting temperature, creates more stable DNA.

▸ Makes synthesis simpler!


Why Addressing?

▸ Need to equip every block with unique address sequence to be used as PRC

primer - RA equals amplification!

▸ Addresses need to be sufficiently different (Hamming distance) and avoided

elsewhere in the blocks.

▸ Addresses should not have a Secondary Structure: Needed for accuratesequence/primer synthesis.


Why Addressing?







Why Addressing?







The Constrained Coding Components

Definition. A sequence a = (a1, . . . , an) ∈ Fnq is self uncorrelated if no proper

prefix of a matches its suffix, i.e., (a1, . . . , ai) ≠ (an−i+1, . . . , an), for all1 ≤ i < n.Extension: A mutually uncorrelated (cross-bifix-free) code is a set of sequencessuch that for any two sequences a,b ∈ Fn

q in the code no proper prefix of aappears as a suffix of b and vice versa [L70, G60, B12].


Address Sequence Construction —

Enumeration and constructions of strings of a given length that contain noelements of some fixed set of strings as subwords [GO80’s].

Take addresses as “forbidden words”: Ensures specific random access.

MU vs. Weakly MU: A k-weakly mutually uncorrelated (WMU) code is a set ofsequences such that for any two sequences a,b ∈ Fn

q in the code no properprefix of a of length ≥ k appears as a suffix of b and vice versa [TKM16].Construction of WMU codes with balancing and Hamming weight constraints?

For a = (a1, . . . , as) ,b = (b1, . . . , bs) ∈ {0,1}s, define

Ψ (a,b) ∶ {0,1}s × {0,1}s → {A,T,C,G}s

according to:

for 1 ≤ i ≤ s, ci =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

A if (ai, bi) = (0,0)C if (ai, bi) = (0,1)T if (ai, bi) = (1,0)G if (ai, bi) = (1,1)


Address Sequence Construction ——

Decoupling the construction: Let C1,C2 ⊆ {0,1}s be two binary block code oflength s. Encode all pairs (a,b) ∈ C1 × C2 using C3 = {Ψ (a,b) ∣ a ∈ C1,b ∈ C2}.Then:

1 C3 is balanced if C2 is balanced.

2 C3 is a k-WMU code if either C1 or C2 is a k-WMU code.

3 If d1 and d2 are the minimum Hamming distances of C1 and C2,respectively, then the minimum Hamming distance of C3 is at leastmin (d1, d2).

See also [LY17] for MU codes.

Information sequence encoding?


Information Sequence Encoding?

Modification based on [WI90’s], and new approach in [TGM17].


Random Access and Rewriting Experiments

▸ Random access achieved via via PCR, addresses used as primers.

▸ Context identification and rewriting performed via gBlock or OE-PCRmethods.


Random Access and Rewriting Experiments

▸ Random access achieved via via PCR, addresses used as primers.

▸ Context identification and rewriting performed via gBlock or OE-PCRmethods.


Reading from DNA I

Reading=Sequencing: MinION Oxford Nanopore (R7).


Reading from DNA II

MinION Oxford Nanopore (R7): Sequence traces.


Reading from DNA III

Major Problem: Very large number of sequence-dependent indel andsubstitution errors (∼ 20%)!

Example Statistics:


Sequence Alignment I

First Step: “Merge traces” into one consensus sequence.

Bioinformatics: Sequence alignment [NW, SW]. Computer Science:Reconstructing sequences from traces [Batu et.al. 2004].

Error-correction

(runlength &

balancing)

Deletion

correction

BWA alignment

Phase 2

Phase 1 Read selection

based on address

MSA

Reference

Aligned reads

Region of poor alignment

Ref Reads

CNS

Recovered file

Reads


Sequence Alignment II

Optimal, but of very high complexity! Works poorly on real data!

AGTTGCGCACTTTAA…GTGCCCAACTTTTA…AATGCGGACTTTAA…AATTGGCAACTTA…

ATGCCACTTAA…

A…GTTGCGCACTTTAA…GTGCCCAACTTTTA…ATGCGGACTTTAA…ATTGGCAACTTA…

TGCCACTTAA…

AG… or AA…

▸ Reference-free sequence alignment: CLUSTAL, KALIGN, MUSCLE,TCOFFEE, etc.

▸ Simple trace reconstruction: [Batu et.al. 2004], [Holenstein et.al. 2016]


Sequence Alignment II

Optimal, but of very high complexity! Works poorly on real data!

AGTTGCGCACTTTAA…GTGCCCAACTTTTA…AATGCGGACTTTAA…AATTGGCAACTTA…

ATGCCACTTAA…

A…GTTGCGCACTTTAA…GTGCCCAACTTTTA…ATGCGGACTTTAA…ATTGGCAACTTA…

TGCCACTTAA…

AG… or AA…

▸ Reference-free sequence alignment: CLUSTAL, KALIGN, MUSCLE,TCOFFEE, etc.

▸ Simple trace reconstruction: [Batu et.al. 2004], [Holenstein et.al. 2016]


Sequence Alignment III

Nanopolish: A specialized deep-learning algorithm for base-calling developed byOxford Nanopore.

Example Statistics:


Our Readout Solution

▸ Select “best” traces for first alignment: Best traces=first sequencestraces, have a small number of errors!Nanopores are biological, tend to “tire,” traces of nonuniform quality.

▸ Perform alignment: Roughly 30 traces involved.

▸ Adjust balance: Iterative procedure. Balancing limits runlengths!

▸ Repeat while recruiting new traces.

Read selection

based on address

Error-correction

using runlength

and balancing

properties

Deletion

correction

MSA

BWA alignment

Reads

CNS

Ref

Reads Reconstructed data







Read selection

based on address

Error-correction

using runlength

and balancing

properties

Deletion

correction

MSA

BWA alignment

Reads

CNS

Ref








Read selection

based on address

Error-correction

using runlength

and balancing

properties

Deletion

correction

MSA

BWA alignment

Reads

CNS

Ref








Read selection

based on address

Error-correction

using runlength

and balancing

properties

Deletion

correction

MSA

BWA alignment

Reads

CNS

Ref



Deletion Correction Through Balancing

Cest=Current estimate of the consensus sequence

▸ Initial alignment:Cest TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace1 TTCACCCCAAAACCGAAAACCGCTTCACGATrace2 TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace3 TTCACCCAAAAACCCGAAAACCGCTTCAGCGA

▸ After MSACest T2C1A1C4A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C4A4...Trace3 T2C1A1C3A5...

▸ After BalancingCest T2C1A1C3A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C3A4...Trace3 T2C1A1C3A5...














After Balancing Deletion Correction...

Asymmetric deletion errors: Handful of errors affecting runlengths of length6 ≤ ` ≤ 8.Most errors appear in runs of the symbol A! Example Statistics:


And the Decoding Works?


Homopolymer Codes

▸ Definition. The integer sequence of a vector x ∈ Fn4 is the sequence of the

length of the runs in x.

Example: x = (0,0,1,3,3,2,1,1)→ I(x) = (2,1,2,1,2).The alternating sequence is the sequence of symbols in x, with all runsequal to one. The string sequence for x above is S(x) = (0,1,3,2,1).

▸ Suppose that CH(n, t) is a code that can correct up to t substitutionerrors. Let

C(n, t) = {x ∈ Fn4 ∶ I(x)mod 2 ∈ CH(n, t)}.

The code C(n, t) can correct up to t asymmetric (decreasing)run-preserving deletions.

▸ Related to sticky deletions [B90’s], [DA05].


Homopolymer Codes









Homopolymer Codes









Readout Time

▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.

▸ How long shall we sequence for?

▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.

▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?

▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!


Readout Time







Readout Time







Readout Time







Readout Time







▸ Formal mathematical theory of coding for DNA storage?

▸ Microarray Synthesis and Shotgun Sequencing: DNA ProfileCodes.

▸ Microarray Synthesis and Nanopore Sequencing: AsymmetricLee Distance and Homopolymer Codes.

▸ DNA Media Aging: Codes in the Damerau Distance.

▸ Mathematical approaches for enabling random access and rewriting.

▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and

long traces.▸ Controlled Assembly: Repeat-avoiding codes.


▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile

Codes.








Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric

Lee Distance and Homopolymer Codes.








Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.















▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.

▸ Sequence Alignment: Reconstructing sequences from short andlong traces.

▸ Controlled Assembly: Repeat-avoiding codes.





▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and

long traces.









▸ Formal mathematical theory of coding for DNA storage?

▸ Microarray Synthesis and Shotgun Sequencing: DNA ProfileCodes.








Codes.









Lee Distance and Homopolymer Codes.























▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.

▸ Sequence Alignment: Reconstructing sequences from short andlong traces.







long traces.









THANK YOU!

Documents

Portable and Random-Access DNA-Based Storage Systemstce.webee.eedev.technion.ac.il/wp-content/uploads/sites/8/2017/07/... · MotivationImplementationsEnd-to-End DesignTHANK YOU! Portable