Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Motivation Implementations End-to-End Design THANK YOU!
Portable and Random-Access DNA-Based Storage Systems
Olgica Milenkovic
A joint work with R. Gabrys, E. Garcia Ruiz, H. M. Kiah, J. Ma, G. J. Puleo, M.
Riedel, D. Soloveichik, H. M. Tabatabaei Yazdi, Y. Yuan, E. Yaakobi and H. Zhao
Henry Taub TCE Conference
June 2017
Motivation Implementations End-to-End Design THANK YOU!
Motivation
Motivation Implementations End-to-End Design THANK YOU!
DNA as Storage Media
DNA stores the “information of life” - the most massive datasource on Earth.
▸ DNA is extremely durable: Can still “read” DNA from mammoths,Neanderthals, and 700,000 old horses!
▸ DNA information content of Human cell: 6.4 GB. Mass of a cell: ∼ 3picograms. No. of cells: 15 − 40 × 1012.
▸ Can one store information in DNA?
▸ This question has been raised before: “There is plenty of room at thebottom,” R. Feynman.
Motivation Implementations End-to-End Design THANK YOU!
DNA as Storage Media
DNA stores the “information of life” - the most massive datasource on Earth.
▸ DNA is extremely durable: Can still “read” DNA from mammoths,Neanderthals, and 700,000 old horses!
▸ DNA information content of Human cell: 6.4 GB. Mass of a cell: ∼ 3picograms. No. of cells: 15 − 40 × 1012.
▸ Can one store information in DNA?
▸ This question has been raised before: “There is plenty of room at thebottom,” R. Feynman.
Motivation Implementations End-to-End Design THANK YOU!
DNA as Storage Media
DNA stores the “information of life” - the most massive datasource on Earth.
▸ DNA is extremely durable: Can still “read” DNA from mammoths,Neanderthals, and 700,000 old horses!
▸ DNA information content of Human cell: 6.4 GB. Mass of a cell: ∼ 3picograms. No. of cells: 15 − 40 × 1012.
▸ Can one store information in DNA?
▸ This question has been raised before: “There is plenty of room at thebottom,” R. Feynman.
Motivation Implementations End-to-End Design THANK YOU!
DNA as Storage Media
DNA stores the “information of life” - the most massive datasource on Earth.
▸ DNA is extremely durable: Can still “read” DNA from mammoths,Neanderthals, and 700,000 old horses!
▸ DNA information content of Human cell: 6.4 GB. Mass of a cell: ∼ 3picograms. No. of cells: 15 − 40 × 1012.
▸ Can one store information in DNA?
▸ This question has been raised before: “There is plenty of room at thebottom,” R. Feynman.
Motivation Implementations End-to-End Design THANK YOU!
We can write...
We can “write” in DNA using what is called the process of DNA Synthesis.
Biochemistry of synthesis: Stitching together bases from the set {A,T,G,C}through deprotection & coupling cycles.
Coupling
3
Capping
Oxidation
De-blocking
6
5’
3’
Final deprotection and cleavage from the support
Initial deprotection
nucleoside2-cyanoethyl phosphoramidite
Blocking unreacted 5’-OH groups
Base 2O
O
O
DMTr
O NP
CH3
CH2CH2NC
CH3
CH3
CH3
5’
3’1
Base 1O
O
O
DMTr
AAAA
5’
3’2
Base 1O
OH
O
AAAA
5’
3’4
Base 1O
OP
O
A
Base 2O
O
O
DMTr
OCH2CH2NC
AAAAA
53’
5’ Base 1O
O
O
A
O
CH3
AAAA
O
Base 1O
O
P
O
A
Base 2O
O
O
DMTr
OCH2CH2NC
AAAA
Final oligonucleotide
5’
3’7
O
Base 1O
O
P
OH
Base 2O
O
O
OCH2CH2NC
OP
O
OCH2CH2NC
O
OP
Base nO
OH
O
OCH2CH2NC
Motivation Implementations End-to-End Design THANK YOU!
We can write...
We can “write” in DNA using what is called the process of DNA Synthesis.
Biochemistry of synthesis: Stitching together bases from the set {A,T,G,C}through deprotection & coupling cycles.
Coupling
3
Capping
Oxidation
De-blocking
6
5’
3’
Final deprotection and cleavage from the support
Initial deprotection
nucleoside2-cyanoethyl phosphoramidite
Blocking unreacted 5’-OH groups
Base 2O
O
O
DMTr
O NP
CH3
CH2CH2NC
CH3
CH3
CH3
5’
3’1
Base 1O
O
O
DMTr
AAAA
5’
3’2
Base 1O
OH
O
AAAA
5’
3’4
Base 1O
OP
O
A
Base 2O
O
O
DMTr
OCH2CH2NC
AAAAA
53’
5’ Base 1O
O
O
A
O
CH3
AAAA
O
Base 1O
O
P
O
A
Base 2O
O
O
DMTr
OCH2CH2NC
AAAA
Final oligonucleotide
5’
3’7
O
Base 1O
O
P
OH
Base 2O
O
O
OCH2CH2NC
OP
O
OCH2CH2NC
O
OP
Base nO
OH
O
OCH2CH2NC
Motivation Implementations End-to-End Design THANK YOU!
We Can Write...
Commercial synthesis: Agilent, Gen9, IDT, Twist Bioscience (recent feature on“Rewriting DNA Synthesis on Silicon”).
▸ DNA microarray based short string - oligo - synthesis (left): Cost effective,large scale. Moderate error rates.
▸ Long strand - gBlock - synthesis (right): Assembles short blocks.Chemical error-correction.
▸ Types of synthesis errors: Deletions, insertions, substitutions.
Motivation Implementations End-to-End Design THANK YOU!
We Can Write...
Commercial synthesis: Agilent, Gen9, IDT, Twist Bioscience (recent feature on“Rewriting DNA Synthesis on Silicon”).
▸ DNA microarray based short string - oligo - synthesis (left): Cost effective,large scale. Moderate error rates.
▸ Long strand - gBlock - synthesis (right): Assembles short blocks.Chemical error-correction.
▸ Types of synthesis errors: Deletions, insertions, substitutions.
Motivation Implementations End-to-End Design THANK YOU!
We Can Write...
Commercial synthesis: Agilent, Gen9, IDT, Twist Bioscience (recent feature on“Rewriting DNA Synthesis on Silicon”).
▸ DNA microarray based short string - oligo - synthesis (left): Cost effective,large scale. Moderate error rates.
▸ Long strand - gBlock - synthesis (right): Assembles short blocks.Chemical error-correction.
▸ Types of synthesis errors: Deletions, insertions, substitutions.
Motivation Implementations End-to-End Design THANK YOU!
We Can Read...
Illumina (MiSeq): Best overall performance of modern sequencing technologiesin terms of yield and accuracy. Relatively small error rates (substitutions andrare deletions). Short read length.
Steps: Cloning /// Shearing /// Reading of unordered pool /// Computeraided alignment of overlapping fragments /// Consensus
Motivation Implementations End-to-End Design THANK YOU!
We Can Read...
Illumina (MiSeq): Best overall performance of modern sequencing technologiesin terms of yield and accuracy. Relatively small error rates (substitutions andrare deletions). Short read length.
Steps: Cloning /// Shearing /// Reading of unordered pool /// Computeraided alignment of overlapping fragments /// Consensus
Motivation Implementations End-to-End Design THANK YOU!
We Can Read...
Oxford Nanopore - MinIon: Longer read lengths, portable architecture.Context-dependent deletion, insertion and substitution errors.
Key properties: Biological pore(s) and motor, base calling using deep learningtechniques.
Motivation Implementations End-to-End Design THANK YOU!
We Can Read...
Oxford Nanopore - MinIon: Longer read lengths, portable architecture.Context-dependent deletion, insertion and substitution errors.
Key properties: Biological pore(s) and motor, base calling using deep learningtechniques.
Motivation Implementations End-to-End Design THANK YOU!
We Can Amplify and Enable Random Access...
Polymerase Chain Reaction (PCR): Cheap, fast, “exponential” informationreplication.
Primers - Key enablers of PCR: Short DNA strands that initiate replication at“strand-matching” locations (red blocks).
Motivation Implementations End-to-End Design THANK YOU!
We Can Amplify and Enable Random Access...
Polymerase Chain Reaction (PCR): Cheap, fast, “exponential” informationreplication.
Primers - Key enablers of PCR: Short DNA strands that initiate replication at“strand-matching” locations (red blocks).
Motivation Implementations End-to-End Design THANK YOU!
We Can Amplify and Enable Random Access...
Polymerase Chain Reaction (PCR): Cheap, fast, “exponential” informationreplication.
Primers - Key enablers of PCR: Short DNA strands that initiate replication at“strand-matching” locations (red blocks).
Motivation Implementations End-to-End Design THANK YOU!
Implementations
Motivation Implementations End-to-End Design THANK YOU!
“Double Helix Serves Double Duty”, NY Times, Jan 2013
▸ Church et al. (Science, 2012) and later Goldman et al. (Nature, 2013)stored 739 KB of data in synthetic DNA, mailed it and recreated theoriginal digital files using Illumina readers.
▸ Digital archival storage systems that will safely store the equivalent of onemillion CDs in a gram of DNA for 10,000 years.
100
D - DNA fragments
1 100 2 12
1 1
25
Encoded data
Reverse complemented encoded data
File identification
Intra-file location information
Parity-check
Reversed or non-reversed flag
A - Binary/text
B - Base-3-encoded
C - DNA-encoded
B e n a m k h o d a v a n d … …
…10001001111001010110110…
20112 20200 02110 10002 02212 01112 02110 10221 02212 11021 02212 10101 02212 10002 … …
TCACT ATATA TGTGA CGATA TAGTA TGTGC GCACG TCTAC GCTGC ACGCA TAGTA CGTCA TAGTA CGATA … …
Alternate fragments have file information reverse complemented
Motivation Implementations End-to-End Design THANK YOU!
“Double Helix Serves Double Duty”, NY Times, Jan 2013
▸ Church et al. (Science, 2012) and later Goldman et al. (Nature, 2013)stored 739 KB of data in synthetic DNA, mailed it and recreated theoriginal digital files using Illumina readers.
▸ Digital archival storage systems that will safely store the equivalent of onemillion CDs in a gram of DNA for 10,000 years.
100
D - DNA fragments
1 100 2 12
1 1
25
Encoded data
Reverse complemented encoded data
File identification
Intra-file location information
Parity-check
Reversed or non-reversed flag
A - Binary/text
B - Base-3-encoded
C - DNA-encoded
B e n a m k h o d a v a n d … …
…10001001111001010110110…
20112 20200 02110 10002 02212 01112 02110 10221 02212 11021 02212 10101 02212 10002 … …
TCACT ATATA TGTGA CGATA TAGTA TGTGC GCACG TCTAC GCTGC ACGCA TAGTA CGTCA TAGTA CGATA … …
Alternate fragments have file information reverse complemented
Motivation Implementations End-to-End Design THANK YOU!
“Data Storage on DNA Can Keep it Safe for Centuries,”NY Times, Dec 2015
Our Work: A fully operational random access and rewritable DNA-basedmemory with Sanger sequencing.
▸ Yazdi et.al., 2015 - First random access, rewritable DNA-based storagesystem. Encoded Wikipedia entries for six US universities.
▸ Large scale testing of the methods: Microsoft Research/Twist Bioscience,2016. Coverage by Spectrum, Nature, New Scientist etc.
Motivation Implementations End-to-End Design THANK YOU!
“Data Storage on DNA Can Keep it Safe for Centuries,”NY Times, Dec 2015
Our Work: A fully operational random access and rewritable DNA-basedmemory with Sanger sequencing.
▸ Yazdi et.al., 2015 - First random access, rewritable DNA-based storagesystem. Encoded Wikipedia entries for six US universities.
▸ Large scale testing of the methods: Microsoft Research/Twist Bioscience,2016. Coverage by Spectrum, Nature, New Scientist etc.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
Practical Obstacles
▸ Cost of synthesis 5 − 6 orders of magnitude higher than for classicalrecorders.
▸ Use native instead of synthetic DNA for encoding.
▸ Readout devices such as Illumina are very expansive and bulky.
▸ Use portable and cheap nanopore sequencers such as MinION.
▸ No known interfaces between DNA-based memories and computationalplatforms.
▸ Use generalizations of strand displacement to implement Minsky’s register
machines.
Motivation Implementations End-to-End Design THANK YOU!
End-to-End Design
Motivation Implementations End-to-End Design THANK YOU!
Writing on DNA
▸ File selection, (JPEG) compression, Base 64 conversion.
▸ Constrained and Error-Control Coding, Addressing.
▸ Synthesis with IDT (Integrated DNA Technologies).
JPEG
compression
Base64
conversion
Encoding into
DNA
DNA
synthesis
TTACCGCCTCCCCCCA
984 bases 16 bases
Address Information
Raw image
DNA blocks
AGTCTGAC ACTGTCTC ⋯ TGCTTGCA
JPEG and Base64 Encoded Image100010111001010…
Conversion into a sequence over the alphabet {A,T,G,C}AATGCAGCGTTTAGAGAT…
AATGCAGC|GTTTAGAG|AT…Parsing into blocks of length 896
Addressing
AATGCCCAGCATAG GTTTCGAGAGCTCTBlock GC-Content Balancing and ECC
CGAAAATGCCCAGCATAG GCTAGTTTCGAGAGCTCT
Motivation Implementations End-to-End Design THANK YOU!
Writing on DNA
▸ File selection, (JPEG) compression, Base 64 conversion.
▸ Constrained and Error-Control Coding, Addressing.
▸ Synthesis with IDT (Integrated DNA Technologies).
JPEG
compression
Base64
conversion
Encoding into
DNA
DNA
synthesis
TTACCGCCTCCCCCCA
984 bases 16 bases
Address Information
Raw image
DNA blocks
AGTCTGAC ACTGTCTC ⋯ TGCTTGCA
JPEG and Base64 Encoded Image100010111001010…
Conversion into a sequence over the alphabet {A,T,G,C}AATGCAGCGTTTAGAGAT…
AATGCAGC|GTTTAGAG|AT…Parsing into blocks of length 896
Addressing
AATGCCCAGCATAG GTTTCGAGAGCTCTBlock GC-Content Balancing and ECC
CGAAAATGCCCAGCATAG GCTAGTTTCGAGAGCTCT
Motivation Implementations End-to-End Design THANK YOU!
Writing on DNA
▸ File selection, (JPEG) compression, Base 64 conversion.
▸ Constrained and Error-Control Coding, Addressing.
▸ Synthesis with IDT (Integrated DNA Technologies).
JPEG
compression
Base64
conversion
Encoding into
DNA
DNA
synthesis
TTACCGCCTCCCCCCA
984 bases 16 bases
Address Information
Raw image
DNA blocks
AGTCTGAC ACTGTCTC ⋯ TGCTTGCA
JPEG and Base64 Encoded Image100010111001010…
Conversion into a sequence over the alphabet {A,T,G,C}AATGCAGCGTTTAGAGAT…
AATGCAGC|GTTTAGAG|AT…Parsing into blocks of length 896
Addressing
AATGCCCAGCATAG GTTTCGAGAGCTCTBlock GC-Content Balancing and ECC
CGAAAATGCCCAGCATAG GCTAGTTTCGAGAGCTCT
Motivation Implementations End-to-End Design THANK YOU!
Why Balancing?
▸ 50% GC-content. Evens out melting temperature, creates more stable DNA.
▸ Makes synthesis simpler!
Motivation Implementations End-to-End Design THANK YOU!
Why Balancing?
▸ 50% GC-content. Evens out melting temperature, creates more stable DNA.
▸ Makes synthesis simpler!
Motivation Implementations End-to-End Design THANK YOU!
Why Addressing?
▸ Need to equip every block with unique address sequence to be used as PRC
primer - RA equals amplification!
▸ Addresses need to be sufficiently different (Hamming distance) and avoided
elsewhere in the blocks.
▸ Addresses should not have a Secondary Structure: Needed for accuratesequence/primer synthesis.
Motivation Implementations End-to-End Design THANK YOU!
Why Addressing?
▸ Need to equip every block with unique address sequence to be used as PRC
primer - RA equals amplification!
▸ Addresses need to be sufficiently different (Hamming distance) and avoided
elsewhere in the blocks.
▸ Addresses should not have a Secondary Structure: Needed for accuratesequence/primer synthesis.
Motivation Implementations End-to-End Design THANK YOU!
Why Addressing?
▸ Need to equip every block with unique address sequence to be used as PRC
primer - RA equals amplification!
▸ Addresses need to be sufficiently different (Hamming distance) and avoided
elsewhere in the blocks.
▸ Addresses should not have a Secondary Structure: Needed for accuratesequence/primer synthesis.
Motivation Implementations End-to-End Design THANK YOU!
The Constrained Coding Components
Definition. A sequence a = (a1, . . . , an) ∈ Fnq is self uncorrelated if no proper
prefix of a matches its suffix, i.e., (a1, . . . , ai) ≠ (an−i+1, . . . , an), for all1 ≤ i < n.Extension: A mutually uncorrelated (cross-bifix-free) code is a set of sequencessuch that for any two sequences a,b ∈ Fn
q in the code no proper prefix of aappears as a suffix of b and vice versa [L70, G60, B12].
Motivation Implementations End-to-End Design THANK YOU!
Address Sequence Construction —
Enumeration and constructions of strings of a given length that contain noelements of some fixed set of strings as subwords [GO80’s].
Take addresses as “forbidden words”: Ensures specific random access.
MU vs. Weakly MU: A k-weakly mutually uncorrelated (WMU) code is a set ofsequences such that for any two sequences a,b ∈ Fn
q in the code no properprefix of a of length ≥ k appears as a suffix of b and vice versa [TKM16].Construction of WMU codes with balancing and Hamming weight constraints?
For a = (a1, . . . , as) ,b = (b1, . . . , bs) ∈ {0,1}s, define
Ψ (a,b) ∶ {0,1}s × {0,1}s → {A,T,C,G}s
according to:
for 1 ≤ i ≤ s, ci =
⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩
A if (ai, bi) = (0,0)C if (ai, bi) = (0,1)T if (ai, bi) = (1,0)G if (ai, bi) = (1,1)
Motivation Implementations End-to-End Design THANK YOU!
Address Sequence Construction ——
Decoupling the construction: Let C1,C2 ⊆ {0,1}s be two binary block code oflength s. Encode all pairs (a,b) ∈ C1 × C2 using C3 = {Ψ (a,b) ∣ a ∈ C1,b ∈ C2}.Then:
1 C3 is balanced if C2 is balanced.
2 C3 is a k-WMU code if either C1 or C2 is a k-WMU code.
3 If d1 and d2 are the minimum Hamming distances of C1 and C2,respectively, then the minimum Hamming distance of C3 is at leastmin (d1, d2).
See also [LY17] for MU codes.
Information sequence encoding?
Motivation Implementations End-to-End Design THANK YOU!
Information Sequence Encoding?
Modification based on [WI90’s], and new approach in [TGM17].
Motivation Implementations End-to-End Design THANK YOU!
Random Access and Rewriting Experiments
▸ Random access achieved via via PCR, addresses used as primers.
▸ Context identification and rewriting performed via gBlock or OE-PCRmethods.
Motivation Implementations End-to-End Design THANK YOU!
Random Access and Rewriting Experiments
▸ Random access achieved via via PCR, addresses used as primers.
▸ Context identification and rewriting performed via gBlock or OE-PCRmethods.
Motivation Implementations End-to-End Design THANK YOU!
Reading from DNA I
Reading=Sequencing: MinION Oxford Nanopore (R7).
Motivation Implementations End-to-End Design THANK YOU!
Reading from DNA II
MinION Oxford Nanopore (R7): Sequence traces.
Motivation Implementations End-to-End Design THANK YOU!
Reading from DNA III
Major Problem: Very large number of sequence-dependent indel andsubstitution errors (∼ 20%)!
Example Statistics:
Motivation Implementations End-to-End Design THANK YOU!
Sequence Alignment I
First Step: “Merge traces” into one consensus sequence.
Bioinformatics: Sequence alignment [NW, SW]. Computer Science:Reconstructing sequences from traces [Batu et.al. 2004].
Error-correction
(runlength &
balancing)
Deletion
correction
BWA alignment
Phase 2
Phase 1 Read selection
based on address
MSA
Reference
Aligned reads
Region of poor alignment
Ref Reads
CNS
Recovered file
Reads
Motivation Implementations End-to-End Design THANK YOU!
Sequence Alignment II
Optimal, but of very high complexity! Works poorly on real data!
AGTTGCGCACTTTAA…GTGCCCAACTTTTA…AATGCGGACTTTAA…AATTGGCAACTTA…
ATGCCACTTAA…
A…GTTGCGCACTTTAA…GTGCCCAACTTTTA…ATGCGGACTTTAA…ATTGGCAACTTA…
TGCCACTTAA…
AG… or AA…
▸ Reference-free sequence alignment: CLUSTAL, KALIGN, MUSCLE,TCOFFEE, etc.
▸ Simple trace reconstruction: [Batu et.al. 2004], [Holenstein et.al. 2016]
Motivation Implementations End-to-End Design THANK YOU!
Sequence Alignment II
Optimal, but of very high complexity! Works poorly on real data!
AGTTGCGCACTTTAA…GTGCCCAACTTTTA…AATGCGGACTTTAA…AATTGGCAACTTA…
ATGCCACTTAA…
A…GTTGCGCACTTTAA…GTGCCCAACTTTTA…ATGCGGACTTTAA…ATTGGCAACTTA…
TGCCACTTAA…
AG… or AA…
▸ Reference-free sequence alignment: CLUSTAL, KALIGN, MUSCLE,TCOFFEE, etc.
▸ Simple trace reconstruction: [Batu et.al. 2004], [Holenstein et.al. 2016]
Motivation Implementations End-to-End Design THANK YOU!
Sequence Alignment III
Nanopolish: A specialized deep-learning algorithm for base-calling developed byOxford Nanopore.
Example Statistics:
Motivation Implementations End-to-End Design THANK YOU!
Our Readout Solution
▸ Select “best” traces for first alignment: Best traces=first sequencestraces, have a small number of errors!Nanopores are biological, tend to “tire,” traces of nonuniform quality.
▸ Perform alignment: Roughly 30 traces involved.
▸ Adjust balance: Iterative procedure. Balancing limits runlengths!
▸ Repeat while recruiting new traces.
Read selection
based on address
Error-correction
using runlength
and balancing
properties
Deletion
correction
MSA
BWA alignment
Reads
CNS
Ref
Reads Reconstructed data
Motivation Implementations End-to-End Design THANK YOU!
Our Readout Solution
▸ Select “best” traces for first alignment: Best traces=first sequencestraces, have a small number of errors!Nanopores are biological, tend to “tire,” traces of nonuniform quality.
▸ Perform alignment: Roughly 30 traces involved.
▸ Adjust balance: Iterative procedure. Balancing limits runlengths!
▸ Repeat while recruiting new traces.
Read selection
based on address
Error-correction
using runlength
and balancing
properties
Deletion
correction
MSA
BWA alignment
Reads
CNS
Ref
Reads Reconstructed data
Motivation Implementations End-to-End Design THANK YOU!
Our Readout Solution
▸ Select “best” traces for first alignment: Best traces=first sequencestraces, have a small number of errors!Nanopores are biological, tend to “tire,” traces of nonuniform quality.
▸ Perform alignment: Roughly 30 traces involved.
▸ Adjust balance: Iterative procedure. Balancing limits runlengths!
▸ Repeat while recruiting new traces.
Read selection
based on address
Error-correction
using runlength
and balancing
properties
Deletion
correction
MSA
BWA alignment
Reads
CNS
Ref
Reads Reconstructed data
Motivation Implementations End-to-End Design THANK YOU!
Our Readout Solution
▸ Select “best” traces for first alignment: Best traces=first sequencestraces, have a small number of errors!Nanopores are biological, tend to “tire,” traces of nonuniform quality.
▸ Perform alignment: Roughly 30 traces involved.
▸ Adjust balance: Iterative procedure. Balancing limits runlengths!
▸ Repeat while recruiting new traces.
Read selection
based on address
Error-correction
using runlength
and balancing
properties
Deletion
correction
MSA
BWA alignment
Reads
CNS
Ref
Reads Reconstructed data
Motivation Implementations End-to-End Design THANK YOU!
Deletion Correction Through Balancing
Cest=Current estimate of the consensus sequence
▸ Initial alignment:Cest TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace1 TTCACCCCAAAACCGAAAACCGCTTCACGATrace2 TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace3 TTCACCCAAAAACCCGAAAACCGCTTCAGCGA
▸ After MSACest T2C1A1C4A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C4A4...Trace3 T2C1A1C3A5...
▸ After BalancingCest T2C1A1C3A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C3A4...Trace3 T2C1A1C3A5...
Motivation Implementations End-to-End Design THANK YOU!
Deletion Correction Through Balancing
Cest=Current estimate of the consensus sequence
▸ Initial alignment:Cest TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace1 TTCACCCCAAAACCGAAAACCGCTTCACGATrace2 TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace3 TTCACCCAAAAACCCGAAAACCGCTTCAGCGA
▸ After MSACest T2C1A1C4A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C4A4...Trace3 T2C1A1C3A5...
▸ After BalancingCest T2C1A1C3A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C3A4...Trace3 T2C1A1C3A5...
Motivation Implementations End-to-End Design THANK YOU!
Deletion Correction Through Balancing
Cest=Current estimate of the consensus sequence
▸ Initial alignment:Cest TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace1 TTCACCCCAAAACCGAAAACCGCTTCACGATrace2 TTCACCCAAAAACCCGAAAACCGCTTCAGCGATrace3 TTCACCCAAAAACCCGAAAACCGCTTCAGCGA
▸ After MSACest T2C1A1C4A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C4A4...Trace3 T2C1A1C3A5...
▸ After BalancingCest T2C1A1C3A4...Trace1 T2C1A1C3A4...Trace2 T2C1A1C3A4...Trace3 T2C1A1C3A5...
Motivation Implementations End-to-End Design THANK YOU!
After Balancing Deletion Correction...
Asymmetric deletion errors: Handful of errors affecting runlengths of length6 ≤ ` ≤ 8.Most errors appear in runs of the symbol A! Example Statistics:
Motivation Implementations End-to-End Design THANK YOU!
And the Decoding Works?
Motivation Implementations End-to-End Design THANK YOU!
Homopolymer Codes
▸ Definition. The integer sequence of a vector x ∈ Fn4 is the sequence of the
length of the runs in x.
Example: x = (0,0,1,3,3,2,1,1)→ I(x) = (2,1,2,1,2).The alternating sequence is the sequence of symbols in x, with all runsequal to one. The string sequence for x above is S(x) = (0,1,3,2,1).
▸ Suppose that CH(n, t) is a code that can correct up to t substitutionerrors. Let
C(n, t) = {x ∈ Fn4 ∶ I(x)mod 2 ∈ CH(n, t)}.
The code C(n, t) can correct up to t asymmetric (decreasing)run-preserving deletions.
▸ Related to sticky deletions [B90’s], [DA05].
Motivation Implementations End-to-End Design THANK YOU!
Homopolymer Codes
▸ Definition. The integer sequence of a vector x ∈ Fn4 is the sequence of the
length of the runs in x.
Example: x = (0,0,1,3,3,2,1,1)→ I(x) = (2,1,2,1,2).The alternating sequence is the sequence of symbols in x, with all runsequal to one. The string sequence for x above is S(x) = (0,1,3,2,1).
▸ Suppose that CH(n, t) is a code that can correct up to t substitutionerrors. Let
C(n, t) = {x ∈ Fn4 ∶ I(x)mod 2 ∈ CH(n, t)}.
The code C(n, t) can correct up to t asymmetric (decreasing)run-preserving deletions.
▸ Related to sticky deletions [B90’s], [DA05].
Motivation Implementations End-to-End Design THANK YOU!
Homopolymer Codes
▸ Definition. The integer sequence of a vector x ∈ Fn4 is the sequence of the
length of the runs in x.
Example: x = (0,0,1,3,3,2,1,1)→ I(x) = (2,1,2,1,2).The alternating sequence is the sequence of symbols in x, with all runsequal to one. The string sequence for x above is S(x) = (0,1,3,2,1).
▸ Suppose that CH(n, t) is a code that can correct up to t substitutionerrors. Let
C(n, t) = {x ∈ Fn4 ∶ I(x)mod 2 ∈ CH(n, t)}.
The code C(n, t) can correct up to t asymmetric (decreasing)run-preserving deletions.
▸ Related to sticky deletions [B90’s], [DA05].
Motivation Implementations End-to-End Design THANK YOU!
Readout Time
▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.
▸ How long shall we sequence for?
▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.
▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?
▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!
Motivation Implementations End-to-End Design THANK YOU!
Readout Time
▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.
▸ How long shall we sequence for?
▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.
▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?
▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!
Motivation Implementations End-to-End Design THANK YOU!
Readout Time
▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.
▸ How long shall we sequence for?
▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.
▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?
▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!
Motivation Implementations End-to-End Design THANK YOU!
Readout Time
▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.
▸ How long shall we sequence for?
▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.
▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?
▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!
Motivation Implementations End-to-End Design THANK YOU!
Readout Time
▸ The sequencing time was ∼ 10 hours, but only “junk” reads generatedafter ∼ 6 hours.
▸ How long shall we sequence for?
▸ Definition. The k-deck of a sequence x is the multiset of all subsequencesof length k of x.
▸ Hybrid reconstruction: One is given a small number M of “long”asymmetric traces (o(n) deletions). What is the smallest value of k for ak-deck that along with the M long traces ensures unique reconstruction?
▸ See [GM17], and some general trace results by Dolecek and Yaakobi groupat ISIT!
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?
▸ Microarray Synthesis and Shotgun Sequencing: DNA ProfileCodes.
▸ Microarray Synthesis and Nanopore Sequencing: AsymmetricLee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.
▸ Microarray Synthesis and Nanopore Sequencing: AsymmetricLee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.
▸ Sequence Alignment: Reconstructing sequences from short andlong traces.
▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.
▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?
▸ Microarray Synthesis and Shotgun Sequencing: DNA ProfileCodes.
▸ Microarray Synthesis and Nanopore Sequencing: AsymmetricLee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.
▸ Microarray Synthesis and Nanopore Sequencing: AsymmetricLee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.
▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.
▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.
▸ Sequence Alignment: Reconstructing sequences from short andlong traces.
▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.
▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
▸ Formal mathematical theory of coding for DNA storage?▸ Microarray Synthesis and Shotgun Sequencing: DNA Profile
Codes.▸ Microarray Synthesis and Nanopore Sequencing: Asymmetric
Lee Distance and Homopolymer Codes.▸ DNA Media Aging: Codes in the Damerau Distance.
▸ Mathematical approaches for enabling random access and rewriting.▸ Address Design: (Weakly) Mutually Uncorrelated Codes.▸ Sequence Alignment: Reconstructing sequences from short and
long traces.▸ Controlled Assembly: Repeat-avoiding codes.
Motivation Implementations End-to-End Design THANK YOU!
THANK YOU!